AWS Site Reliability Engineer ( Data Platform)

601525
  • £450 - £455 per day
  • City of London, England
  • Permanent


Our client, a prominent organisation in the technology sector, is currently seeking an AWS Site Reliability Engineer (SRE) to support and scale a cloud-native data platform built on AWS, Snowflake, and Databricks. This role focuses on enhancing reliability through automation, disaster recovery testing, resiliency engineering, observability, and proactive SLO/SLI/SLA management.

Key Responsibilities:

  • Design, build, and maintain automation for infrastructure provisioning, platform operations, and incident response using IaC and CI/CD.
  • Lead resiliency and disaster recovery planning, including regular DR drills, failure testing, and recovery validation across AWS and data platform components.
  • Define, implement, and manage SLIs, SLOs, and SLAs for critical data pipelines and platform services; utilise error budgets to guide reliability improvements.
  • Build and operate robust observability solutions (metrics, logs, traces, alerts) for AWS services, Snowflake, and Databricks workloads.
  • Partner with data engineering and platform teams to embed reliability-by-design into architecture and delivery practices.
  • Perform root cause analysis (RCA) and drive continuous improvement to reduce toil and enhance platform availability and performance.
  • Own and drive resolution of incidents and service requests raised by consumer teams, providing operational support and automating fixes to improve reliability and user experience.

Job Requirements:

  • Practical knowledge of SRE principles, including SLO/SLI/SLA design and error budgets.
  • Strong experience with AWS (e.g., EC2, S3, IAM, VPC, CloudWatch) in production environments.
  • Experience with observability tools and monitoring/alerting best practices.
  • Hands-on experience with automation and IaC (Terraform, CloudFormation, CDK) and scripting (Python, Bash).
  • Exposure to data platforms such as Snowflake and/or Databricks.

Nice to Have:

  • Experience running DR tests, chaos engineering, or resiliency testing in cloud environments.
  • Familiarity with CI/CD pipelines and GitOps practices.
  • Background supporting large-scale data or analytics platforms.


If you have the expertise and passion for cloud-native data platforms and are ready to take on new challenges in a dynamic contract role, we would love to hear from you. Apply now to join our client's innovative team.

Harry Stayman Associate Consultant

Apply for this role