DevOps Engineer - Reinforcement Learning Platforms
We are seeking an experienced DevOps Engineer to help build and scale a web-based platform for reinforcement learning (RL) training and RLOps. You will design, implement, and maintain the cloud infrastructure, CI/CD pipelines, and deployment systems that support large-scale RL workloads.
Responsibilities
* Design and manage scalable cloud infrastructure for high-performance RL training and distributed environments
* Build and optimise CI/CD pipelines for open-source and enterprise components
* Implement containerisation and orchestration using Docker and Kubernetes
* Develop Infrastructure as Code solutions (Terraform, CloudFormation, Pulumi)
* Implement monitoring, logging, and alerting for distributed ML systems
* Collaborate with ML teams on resource optimisation and cost efficiency
* Apply security best practices, manage access controls, and ensure compliance
* Automate operational tasks: backups, disaster recovery, maintenance
* Support GPU clusters and distributed compute resources for RL workloads
* Maintain availability and performance of production ML systems
Requirements
* Degree in Computer Science/Engineering or 3+ years of DevOps/infrastructure experience
* Strong background with AWS, GCP, or Azure, including ML/AI workloads
* Proficiency with Docker, Kubernetes, and ML-focused orchestration
* Experience with Terraform/CloudFormation/Pulumi and configuration management
* Solid understanding of CI/CD tools (GitHub Actions, GitLab CI, Jenkins)
* Knowledge of monitoring/observability tools (Prometheus, Grafana, OpenObserve)
* Experience with GPU infrastructure and distributed ML compute frameworks
* Familiarity with MLOps tools and model lifecycle management
* Strong scripting skills (Python, Bash)
* Understanding of cloud networking, security, and database fundamentals
* Experience with HPC environments or schedulers is a plus
* Strong problem-solving and communication skills
Compensation & Benefits
* Stock options
* 30 days' holiday plus bank holidays
* Flexible and remote working options
* Enhanced parental leave
* £500 annual learning and development budget
* Pension scheme
* Regular socials and quarterly gatherings
* Bike-to-Work scheme