Our client, a prominent organisation in the technology sector, is currently seeking an AWS Site Reliability Engineer (SRE) to support and scale a cloud-native data platform built on AWS, Snowflake, and Databricks. This role focuses on enhancing reliability through automation, disaster recovery testing, resiliency engineering, observability, and proactive SLO/SLI/SLA management.
Key Responsibilities:
- Design, build, and maintain automation for infrastructure provisioning, platform operations, and incident response using IaC and CI/CD.
- Lead resiliency and disaster recovery planning, including regular DR drills, failure testing, and recovery validation across AWS and data platform components.
- Define, implement, and manage SLIs, SLOs, and SLAs for critical data pipelines and platform services; utilise error budgets to guide reliability improvements.
- Build and operate robust observability solutions (metrics, logs, traces, alerts) for AWS services, Snowflake, and Databricks workloads.
- Partner with data engineering and platform teams to embed reliability-by-design into architecture and delivery practices.
- Perform root cause analysis (RCA) and drive continuous improvement to reduce toil and enhance platform availability and performance.
- Own and drive resolution of incidents and service requests raised by consumer teams, providing operational support and automating fixes to improve reliability and user experience.
Job Requirements:
- Practical knowledge of SRE principles, including SLO/SLI/SLA design and error budgets.
- Strong experience with AWS (e.g., EC2, S3, IAM, VPC, CloudWatch) in production environments.
- Experience with observability tools and monitoring/alerting best practices.
- Hands-on experience with automation and IaC (Terraform, CloudFormation, CDK) and scripting (Python, Bash).
- Exposure to data platforms such as Snowflake and/or Databricks.
Nice to Have:
- Experience running DR tests, chaos engineering, or resiliency testing in cloud environments.
- Familiarity with CI/CD pipelines and GitOps practices.
- Background supporting large-scale data or analytics platforms.
If you have the expertise and passion for cloud-native data platforms and are ready to take on new challenges in a dynamic contract role, we would love to hear from you. Apply now to join our client's innovative team.