Senior Site Reliability Engineer — AI Studio (Inference Platform) @Nebius
DevOps / Sysadmin
Salary competitive sal..
Remote Location
remote Europe, USA
Job Type unspecified
Posted 1wk ago

[Hiring] Senior Site Reliability Engineer — AI Studio (Inference Platform) @Nebius

1wk ago - Nebius is hiring a remote Senior Site Reliability Engineer — AI Studio (Inference Platform). 💸 Salary: competitive salary and comprehensive benefits package. 📍Location: Europe, USA

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more.

Role Description

In this role you will own the reliability, performance, and observability of the entire inference stack.

  • Design and refine telemetry pipelines — metrics, logs, and traces that turn hundreds of terabytes of signal into clear, actionable insight.
  • Tune Kubernetes autoscalers to squeeze more efficiency out of GPUs.
  • Craft Terraform modules that bake resilience into every new cluster.
  • Harden request-routing and retry logic so even transient failures go unnoticed by users.
  • Rely on automation and runbooks to detect, isolate, and remediate problems in minutes.
  • Drive the post-mortem culture that prevents recurrence.
  • Scale the platform smoothly while hitting aggressive cost and reliability targets.

Qualifications

  • Deep fluency with Kubernetes, Prometheus, Grafana, Terraform, and infrastructure-as-code.
  • Comfortable scripting in Python or Bash.
  • Understanding of alert design and SLOs for high-throughput APIs.
  • Experience in production environments and knowledge of how distributed back-ends fail.
  • Experience shepherding GPU-heavy workloads — whether with vLLM, Triton, Ray, or another accelerator stack.
  • Background in MLOps or model-hosting platforms.
  • Passion for building self-healing systems and debugging performance from kernel to application layer.
  • Enjoy collaborating with software engineers to turn reliability into a feature.

Requirements

  • Experience with high-throughput APIs.
  • Ability to work under extreme load and recover gracefully from unexpected issues.

Benefits

  • Competitive salary and comprehensive benefits package.
  • Opportunities for professional growth within Nebius.
  • Flexible working arrangements.
  • A dynamic and collaborative work environment that values initiative and innovation.
Before You Apply
remote Be aware of the location restriction for this remote position: Europe, USA
Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Senior Site Reliability Engineer — AI Studio (Inference Platform) @Nebius
DevOps / Sysadmin
Salary competitive sal..
Remote Location
remote Europe, USA
Job Type unspecified
Posted 1wk ago
Apply for this position Unlock 75,545 Remote Jobs
remote Be aware of the location restriction for this remote position: Europe, USA
Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Apply for this position Unlock 75,545 Remote Jobs
×
  • Unlock 75,545 hidden remote jobs.
  • Your shortcut to remote work. Apply before everyone else.
  • Click and apply. No middlemen, no hassle.
  • Filter by location/skills/salary…
  • Create custom email alerts
Unlock All Jobs Now