Site Reliability Engineer @Runpod, Inc.

DevOps / Sysadmin

Salary usd 150,000 - 2..	Remote Location 🇺🇸 USA Only
Job Type full-time	Posted 3d ago

[Hiring] Site Reliability Engineer @Runpod, Inc.

3d ago - Runpod, Inc. is hiring a remote Site Reliability Engineer. 💸 Salary: usd 150,000 - 200,000 per year 📍Location: USA

Role Description

As a Site Reliability Engineer on the Reliability team, you will focus on ensuring the stability and resilience of Runpod’s distributed platform. You will partner with engineering teams to improve system design, strengthen observability, and prevent incidents before they happen.

This role blends software engineering with production operations. You’ll work on reliability frameworks, SLO design, automation, and production hardening, reducing errors and improving performance across different services and infrastructure.

This is a high-impact role central to maintaining trust with developers running critical AI workloads on Runpod.

Your Impact

Increase platform uptime and reduce incident frequency and duration
Establish and operationalize SLIs/SLOs across services
Improve MTTR through better tooling, automation, and runbooks
Strengthen production readiness standards
Drive long-term systemic reliability improvements

You will influence how reliability is defined and measured across Runpod and help build the operational backbone of the company.

Responsibilities

Reliability Engineering
- Define and implement SLIs/SLOs for critical services
- Lead incident response and coordinate cross-team mitigation efforts
- Conduct blameless postmortems and ensure corrective actions are completed
- Perform production readiness reviews for new services and features
- Identify systemic risks and drive preventative improvements
Observability & Monitoring
- Design and improve monitoring, alerting, and dashboards (Prometheus, Grafana, etc.)
- Improve signal-to-noise ratio in alerts and reduce alert fatigue
- Build internal tooling for reliability tracking and reporting
- Improve visibility into GPU performance and distributed systems health
Automation & Toil Reduction
- Automate recurring operational workflows
- Build tools and scripts (Python, Go, Bash) to eliminate manual processes
- Improve deployment safety through automation and guardrails
- Strengthen CI/CD reliability and release processes
Cross-Functional Reliability Advocacy
- Partner with engineering teams to improve system resilience
- Provide guidance on fault tolerance, scalability, and failure handling
- Contribute to architectural discussions with a reliability-first mindset

Qualifications

5+ years of experience in SRE, Reliability Engineering, or Production Engineering
Strong Linux systems and Networking expertise
Experience managing containerized production systems
Strong understanding of distributed systems and failure modes
Experience defining and managing SLIs/SLOs
Proven incident response and postmortem leadership experience
Strong scripting or programming skills
Experience with monitoring and alerting systems
Excellent written communication skills
Successful completion of a background check

Requirements

Experience with GPU infrastructure or AI/ML platforms
Experience improving reliability in high-growth or large scale environments
Familiarity with GPU observability tooling
Experience with Infrastructure as Code
Experience working in startup environments
Experience building internal reliability platforms or frameworks

Benefits

The competitive base pay for this position ranges from $150,000- $200,000 USD.
Meaningful equity in a fast-growing company — everyone on the team receives stock options.
Generous medical, dental & vision plans.
Flexible PTO — take the time you need to recharge.
Most roles are remote work first with an inclusive, collaborative teams utilizing Slack as the main form of internal communication.
Join a passionate team on the cutting edge of AI infrastructure.

Kickstart Your Job Search

⚡ 12,726 remote jobs added this week

You're seeing 0.4% of available roles

Unlock 152,720 jobs →

Meet JobCopilot: Your Personal Al Job Hunter

Automatically Apply to Remote Jobs

Try it now →

Before You Apply

️

🇺🇸	Be aware of the location restriction for this remote position: USA Only
‼	Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.

Back to Remote jobs > DevOps / Sysadmin

Site Reliability Engineer @Runpod, Inc.

DevOps / Sysadmin

Salary usd 150,000 - 2..	Remote Location 🇺🇸 USA Only
Job Type full-time	Posted 3d ago

Apply for this position

Unlock 152,720 Remote Jobs

️

🇺🇸	Be aware of the location restriction for this remote position: USA Only
‼	Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.

Apply for this position

Unlock 152,720 Remote Jobs

[Hiring] Site Reliability Engineer @Runpod, Inc.

Your Impact

Responsibilities

Apply to the best remote jobsbefore everyone else

Apply to the best remote jobs
before everyone else