Back to Remote jobs  >   Devops
Site Reliability Engineer, AI Infrastructure @Somewhere
Devops
Salary unspecified
Employment Type full-time
Posted 3d ago

[Hiring] Site Reliability Engineer, AI Infrastructure @Somewhere

3d ago - Somewhere is hiring a remote Site Reliability Engineer, AI Infrastructure. πŸ’Έ Salary: unspecified πŸ“Location: UK, Canada, Germany, France, India, Brazil, South Africa, Philippines, Latvia

Role Description

We are hiring a Senior Site Reliability Engineer (SRE) to own the reliability of our GPU training and inference clusters from the US West Coast. You will serve as the on-call anchor for Asian hours, drive incident response on multi-thousand GPU fabrics, and push our platform toward higher availability, faster recovery, and cleaner operations. This is a hands-on role with significant production impact from week one.

Key Responsibilities

  • Cluster Operations & Hardening
    • Production SLURM Management: Operate and harden production SLURM clusters running large-scale distributed training and inference jobs.
    • Hardware Health: Own the health of NVIDIA HGX and DGX nodes, including GPU, NVLink, NVSwitch, and BMC diagnostics.
    • Fabric Tuning: Debug and tune NVIDIA Quantum InfiniBand fabrics (NDR and HDR), including Subnet Manager, topology, adaptive routing, SHARP, and congestion issues.
    • Root Cause Analysis: Drive deep-dive RCA on GPU failures, XID errors, ECC events, thermal throttling, and link flaps.
  • Automation & Observability
    • Systems Automation: Write robust automation in Python, Go, or Bash to replace manual tasks, improve MTTR, and scale operations efficiently.
    • Observability Stack: Build and maintain observability for GPU fleets using Prometheus, Grafana, DCGM, node exporter, and custom exporters.
    • Capacity & Rollouts: Contribute to capacity planning, firmware rollout strategy, and cluster bring-up for new sites.
  • Collaboration & Incident Response
    • Workload Optimization: Partner with customer workload teams on NCCL tuning, job scheduling policy, QoS, and fairshare.
    • Operational Excellence: Lead post-mortems, write comprehensive runbooks, and improve change management processes across global regions.
    • On-Call Leadership: Participate in the on-call rotation for US hours and handle escalations from international sites when necessary.

Qualifications

  • 5+ years in SRE, systems engineering, or HPC operations.
  • Extensive production experience with SLURM at scale (accounting/slurmdbd, prolog/epilog scripts, cgroups, GRES, topology awareness).
  • Hands-on experience with NVIDIA datacenter GPUs, driver stacks, CUDA runtime, Fabric Manager, nvidia-smi, DCGM, and GPU Direct RDMA.
  • Operational experience with InfiniBand fabrics at 100G or higher (OpenSM/UFM, ibdiagnet, perfquery, and fabric troubleshooting).
  • Expert-level Linux admin skills (Ubuntu/RHEL family), including kernel tuning, systemd, networking, and PXE provisioning.
  • Solid scripting skills in Python and Bash, plus working knowledge of Ansible or Terraform.

Nice to Have

  • Experience with NCCL internals, PyTorch distributed, or Megatron-style training stacks.
  • Familiarity with BCM (Base Command Manager), Run:ai, or similar managers.
  • Experience running Kubernetes on bare metal with GPU, Network, and MPI Operators.
  • Exposure to high-performance storage like Lustre, WEKA, VAST, or BeeGFS.
  • Prior work in an AI cloud, neocloud, HPC center, or hyperscaler environment.

Benefits

  • You will touch clusters that train world-class models, working with the most advanced hardware available.
  • We maintain a flat structure with direct access to leadership and a culture built around technical craftsmanship and ownership.
  • Full remote flexibility with occasional travel for team summits and datacenter site visits.
  • Comprehensive US benefits including performance bonuses, equity participation, and 401(k) eligibility.
Before You Apply
️
remote Be aware of the location restriction for this remote position: UK, Canada, Germany, France, India, Brazil, South Africa, Philippines, Latvia
β€Ό Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Back to Remote jobs  >   Devops
Site Reliability Engineer, AI Infrastructure @Somewhere
Devops
Salary unspecified
Employment Type full-time
Posted 3d ago
Apply for this position
Did not apply βœ“
Applied βœ“
Sent Follow-Up βœ“
Interview Scheduled βœ“
Interview Completed βœ“
Offer Accepted βœ“
Offer Declined βœ“
Unlock 152,720 Remote Jobs
️
remote Be aware of the location restriction for this remote position: UK, Canada, Germany, France, India, Brazil, South Africa, Philippines, Latvia
β€Ό Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Apply for this position
Did not apply βœ“
Applied βœ“
Sent Follow-Up βœ“
Interview Scheduled βœ“
Interview Completed βœ“
Offer Accepted βœ“
Offer Declined βœ“
Unlock 152,720 Remote Jobs
Γ—

Apply to the best remote jobs
before everyone else

Access 152,720+ vetted remote jobs and get daily alerts.

4.9 β˜…β˜…β˜…β˜…β˜… from 500+ reviews
Unlock All Jobs Now

Maybe later