Site Reliability Engineer @Runpod, Inc.
DevOps / Sysadmin
Salary usd 150,000 - 2..
Remote Location
πŸ‡ΊπŸ‡Έ USA Only
Job Type full-time
Posted 3d ago

[Hiring] Site Reliability Engineer @Runpod, Inc.

3d ago - Runpod, Inc. is hiring a remote Site Reliability Engineer. πŸ’Έ Salary: usd 150,000 - 200,000 per year πŸ“Location: USA

Role Description

As a Site Reliability Engineer on the Reliability team, you will focus on ensuring the stability and resilience of Runpod’s distributed platform. You will partner with engineering teams to improve system design, strengthen observability, and prevent incidents before they happen.

This role blends software engineering with production operations. You’ll work on reliability frameworks, SLO design, automation, and production hardening, reducing errors and improving performance across different services and infrastructure.

This is a high-impact role central to maintaining trust with developers running critical AI workloads on Runpod.

Your Impact

  • Increase platform uptime and reduce incident frequency and duration
  • Establish and operationalize SLIs/SLOs across services
  • Improve MTTR through better tooling, automation, and runbooks
  • Strengthen production readiness standards
  • Drive long-term systemic reliability improvements

You will influence how reliability is defined and measured across Runpod and help build the operational backbone of the company.

Responsibilities

  • Reliability Engineering
    • Define and implement SLIs/SLOs for critical services
    • Lead incident response and coordinate cross-team mitigation efforts
    • Conduct blameless postmortems and ensure corrective actions are completed
    • Perform production readiness reviews for new services and features
    • Identify systemic risks and drive preventative improvements
  • Observability & Monitoring
    • Design and improve monitoring, alerting, and dashboards (Prometheus, Grafana, etc.)
    • Improve signal-to-noise ratio in alerts and reduce alert fatigue
    • Build internal tooling for reliability tracking and reporting
    • Improve visibility into GPU performance and distributed systems health
  • Automation & Toil Reduction
    • Automate recurring operational workflows
    • Build tools and scripts (Python, Go, Bash) to eliminate manual processes
    • Improve deployment safety through automation and guardrails
    • Strengthen CI/CD reliability and release processes
  • Cross-Functional Reliability Advocacy
    • Partner with engineering teams to improve system resilience
    • Provide guidance on fault tolerance, scalability, and failure handling
    • Contribute to architectural discussions with a reliability-first mindset

Qualifications

  • 5+ years of experience in SRE, Reliability Engineering, or Production Engineering
  • Strong Linux systems and Networking expertise
  • Experience managing containerized production systems
  • Strong understanding of distributed systems and failure modes
  • Experience defining and managing SLIs/SLOs
  • Proven incident response and postmortem leadership experience
  • Strong scripting or programming skills
  • Experience with monitoring and alerting systems
  • Excellent written communication skills
  • Successful completion of a background check

Requirements

  • Experience with GPU infrastructure or AI/ML platforms
  • Experience improving reliability in high-growth or large scale environments
  • Familiarity with GPU observability tooling
  • Experience with Infrastructure as Code
  • Experience working in startup environments
  • Experience building internal reliability platforms or frameworks

Benefits

  • The competitive base pay for this position ranges from $150,000- $200,000 USD.
  • Meaningful equity in a fast-growing company β€” everyone on the team receives stock options.
  • Generous medical, dental & vision plans.
  • Flexible PTO β€” take the time you need to recharge.
  • Most roles are remote work first with an inclusive, collaborative teams utilizing Slack as the main form of internal communication.
  • Join a passionate team on the cutting edge of AI infrastructure.
Before You Apply
️
πŸ‡ΊπŸ‡Έ Be aware of the location restriction for this remote position: USA Only
β€Ό Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Site Reliability Engineer @Runpod, Inc.
DevOps / Sysadmin
Salary usd 150,000 - 2..
Remote Location
πŸ‡ΊπŸ‡Έ USA Only
Job Type full-time
Posted 3d ago
Apply for this position
Did not apply βœ“
Applied βœ“
Sent Follow-Up βœ“
Interview Scheduled βœ“
Interview Completed βœ“
Offer Accepted βœ“
Offer Declined βœ“
Unlock 152,720 Remote Jobs
️
πŸ‡ΊπŸ‡Έ Be aware of the location restriction for this remote position: USA Only
β€Ό Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Apply for this position
Did not apply βœ“
Applied βœ“
Sent Follow-Up βœ“
Interview Scheduled βœ“
Interview Completed βœ“
Offer Accepted βœ“
Offer Declined βœ“
Unlock 152,720 Remote Jobs
Γ—

Apply to the best remote jobs
before everyone else

Access 152,720+ vetted remote jobs and get daily alerts.

4.9 β˜…β˜…β˜…β˜…β˜… from 500+ reviews
Unlock All Jobs Now

Maybe later