Site Reliability Engineer, AI Infrastructure @Somewhere

Devops

Salary unspecified	Remote Location UK, Canada, Germany, France, India, Brazil, South Africa, Philippines, Latvia
Employment Type full-time	Posted 3d ago

[Hiring] Site Reliability Engineer, AI Infrastructure @Somewhere

3d ago - Somewhere is hiring a remote Site Reliability Engineer, AI Infrastructure. 💸 Salary: unspecified 📍Location: UK, Canada, Germany, France, India, Brazil, South Africa, Philippines, Latvia

Role Description

We are hiring a Senior Site Reliability Engineer (SRE) to own the reliability of our GPU training and inference clusters from the US West Coast. You will serve as the on-call anchor for Asian hours, drive incident response on multi-thousand GPU fabrics, and push our platform toward higher availability, faster recovery, and cleaner operations. This is a hands-on role with significant production impact from week one.

Key Responsibilities

Cluster Operations & Hardening
- Production SLURM Management: Operate and harden production SLURM clusters running large-scale distributed training and inference jobs.
- Hardware Health: Own the health of NVIDIA HGX and DGX nodes, including GPU, NVLink, NVSwitch, and BMC diagnostics.
- Fabric Tuning: Debug and tune NVIDIA Quantum InfiniBand fabrics (NDR and HDR), including Subnet Manager, topology, adaptive routing, SHARP, and congestion issues.
- Root Cause Analysis: Drive deep-dive RCA on GPU failures, XID errors, ECC events, thermal throttling, and link flaps.
Automation & Observability
- Systems Automation: Write robust automation in Python, Go, or Bash to replace manual tasks, improve MTTR, and scale operations efficiently.
- Observability Stack: Build and maintain observability for GPU fleets using Prometheus, Grafana, DCGM, node exporter, and custom exporters.
- Capacity & Rollouts: Contribute to capacity planning, firmware rollout strategy, and cluster bring-up for new sites.
Collaboration & Incident Response
- Workload Optimization: Partner with customer workload teams on NCCL tuning, job scheduling policy, QoS, and fairshare.
- Operational Excellence: Lead post-mortems, write comprehensive runbooks, and improve change management processes across global regions.
- On-Call Leadership: Participate in the on-call rotation for US hours and handle escalations from international sites when necessary.

Qualifications

5+ years in SRE, systems engineering, or HPC operations.
Extensive production experience with SLURM at scale (accounting/slurmdbd, prolog/epilog scripts, cgroups, GRES, topology awareness).
Hands-on experience with NVIDIA datacenter GPUs, driver stacks, CUDA runtime, Fabric Manager, nvidia-smi, DCGM, and GPU Direct RDMA.
Operational experience with InfiniBand fabrics at 100G or higher (OpenSM/UFM, ibdiagnet, perfquery, and fabric troubleshooting).
Expert-level Linux admin skills (Ubuntu/RHEL family), including kernel tuning, systemd, networking, and PXE provisioning.
Solid scripting skills in Python and Bash, plus working knowledge of Ansible or Terraform.

Nice to Have

Experience with NCCL internals, PyTorch distributed, or Megatron-style training stacks.
Familiarity with BCM (Base Command Manager), Run:ai, or similar managers.
Experience running Kubernetes on bare metal with GPU, Network, and MPI Operators.
Exposure to high-performance storage like Lustre, WEKA, VAST, or BeeGFS.
Prior work in an AI cloud, neocloud, HPC center, or hyperscaler environment.

Benefits

You will touch clusters that train world-class models, working with the most advanced hardware available.
We maintain a flat structure with direct access to leadership and a culture built around technical craftsmanship and ownership.
Full remote flexibility with occasional travel for team summits and datacenter site visits.
Comprehensive US benefits including performance bonuses, equity participation, and 401(k) eligibility.

Kickstart Your Job Search

⚡ 12,726 remote jobs added this week

You're seeing 0.4% of available roles

Unlock 152,720 jobs →

Meet JobCopilot: Your Personal Al Job Hunter

Automatically Apply to Remote Jobs

Try it now →

Before You Apply

️

	Be aware of the location restriction for this remote position: UK, Canada, Germany, France, India, Brazil, South Africa, Philippines, Latvia
‼	Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.

Back to Remote jobs > Devops

Site Reliability Engineer, AI Infrastructure @Somewhere

Devops

Salary unspecified	Remote Location UK, Canada, Germany, France, India, Brazil, South Africa, Philippines, Latvia
Employment Type full-time	Posted 3d ago

Apply for this position

Unlock 152,720 Remote Jobs

️

	Be aware of the location restriction for this remote position: UK, Canada, Germany, France, India, Brazil, South Africa, Philippines, Latvia
‼	Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.

Apply for this position

Unlock 152,720 Remote Jobs

[Hiring] Site Reliability Engineer, AI Infrastructure @Somewhere

Apply to the best remote jobsbefore everyone else

Apply to the best remote jobs
before everyone else