Role Description
The selected candidate will join the team responsible for engineering and operating large-scale GPU and compute platforms that power AI/ML and high-performance computing workloads across multiple datacenters. This role focuses on building reliable, scalable GPU platforms and helping internal users successfully run AI/ML and high-performance workloads on Kubernetes and related compute infrastructure.
-
Design, implement, and support GPU/Kubernetes clusters and supporting infrastructure
-
Supporting AI/ML training, simulation, and HPC workload customers
-
Develop automation and tooling for cluster provisioning, configuration management, and platform operations
-
Collaborate with application and research teams to optimize workloads running on GPU infrastructure
-
Implement monitoring, observability, and performance tuning across GPU and compute platforms
-
Troubleshoot infrastructure issues across compute, networking, and container platforms (occasional on-call support)
-
Contribute to platform reliability, scalability, and operational best practices
-
Produce clear technical documentation and operational runbooks
Qualifications
-
5+ years of Linux systems engineering or infrastructure experience
-
2+ years working with container platforms such as Kubernetes or OpenShift
-
Familiarity with Kubernetes GPU scheduling and related tooling
-
Familiarity with CI/CD pipelines and platform engineering practices
-
Experience operating compute infrastructure for high-performance workloads or large distributed systems
-
Strong scripting or programming skills (Python, Bash, or similar)
-
Experience building infrastructure automation and operational tooling
-
Strong troubleshooting and problem-solving skills across complex infrastructure systems
-
Ability to communicate clearly with both platform engineers and application teams
-
Demonstrated ability to manage multiple technical initiatives simultaneously
Requirements
-
Bachelorโs degree in Computer Science, Engineering, or related field, or equivalent experience (Nice to Have)
-
Experience with observability platforms such as Prometheus, Grafana, or similar (Nice to Have)
-
Experience with infrastructure automation tools (Ansible, Terraform, etc.) (Nice to Have)
-
Experience with high-speed networking technologies such as InfiniBand or RDMA (Nice to Have)
Benefits
-
Immediate medical, dental, and prescription drug coverage
-
Flexible family care, parental leave, new parent ramp-up programs, subsidized back-up child care and more
-
Vehicle discount program for employees and family members, and management leases
-
Tuition assistance
-
Established and active employee resource groups
-
Paid time off for individual and team community service
-
A generous schedule of paid holidays, including the week between Christmas and New Yearโs Day
-
Paid time off and the option to purchase additional vacation time