Role Description
We're looking for a mid-level DevOps / Site Reliability Engineer to own and scale our cloud infrastructure. You'll work closely with engineering and ML teams to keep our systems reliable, observable, and fast β directly supporting the infrastructure that powers AI data pipelines at scale.
-
Own cloud infrastructure on AWS β EC2, EKS, RDS, S3, IAM, VPC
-
Manage Kubernetes clusters and container orchestration end-to-end
-
Build and maintain CI/CD pipelines using GitHub Actions or similar
-
Implement monitoring, alerting, and observability stacks (Prometheus, Grafana, or DataDog)
-
Improve reliability, performance, and security of production systems
-
Automate infrastructure with Terraform or similar IaC tools
-
Debug and resolve issues across complex, distributed systems
-
Participate in design reviews and help raise the infrastructure bar
Qualifications
-
3β5 years in DevOps, SRE, or infrastructure engineering
-
Strong AWS experience β EKS, EC2, RDS, S3, IAM
-
Kubernetes β deployment, scaling, troubleshooting in production
-
CI/CD pipelines β GitHub Actions, ArgoCD, or similar
-
Infrastructure as Code β Terraform, Pulumi, or CDK
-
Python or Go scripting
-
Experience working in production environments with real users
-
Comfort with ambiguity and ability to operate autonomously
Nice to Have
-
Experience supporting ML training workloads or GPU clusters
-
Familiarity with distributed computing or large-scale data pipelines
-
Prior work at an AI, ML, or data company
-
Open-source contributions or published technical writing
Benefits
-
Competitive compensation and meaningful equity
-
Direct impact on frontier AI model training and evaluation infrastructure
-
Flexible, remote-friendly environment with low bureaucracy
-
A small, high-caliber team with deep AI research expertise
-
Health, wellness, and learning & development benefits