Senior Staff Site Reliability Engineer @Rocket.Chat
DevOps / Sysadmin
Salary unspecified
Remote Location
πŸ‡ΊπŸ‡Έ USA Only
Job Type full-time
Posted 2wks ago

[Hiring] Senior Staff Site Reliability Engineer @Rocket.Chat

2wks ago - Rocket.Chat is hiring a remote Senior Staff Site Reliability Engineer. πŸ’Έ Salary: unspecified πŸ“Location: USA

Role Description

As a Senior Staff Site Reliability Engineer, you will be responsible for the overall reliability, scalability, and operational excellence of Rocket.Chat’s infrastructure and deployment systems. You will lead the infrastructure strategy, guide the platform roadmap, and ensure that all systems run reliably and efficiently across our global deployments. This includes overseeing:

  • Cloud infrastructure
  • Kubernetes platforms
  • Deployment automation
  • Monitoring systems
  • Operational processes

You will work closely with Engineering, Security, Product, and Leadership teams to ensure that infrastructure capabilities support product growth, customer demands, and operational resilience. Your leadership will ensure that reliability engineering, automation, and operational best practices are embedded into the development lifecycle across the company.

Qualifications

  • Strong background in software engineering and infrastructure architecture with experience designing and operating large-scale distributed systems.
  • Expert understanding of microservices, event-driven architectures, stateful vs. stateless scaling constraints, and data consistency models.
  • Advanced coding proficiency (Go preferred, Python acceptable) capable of building complex core frameworks and contributing to the core Rocket.Chat codebase when necessary.
  • Deep expertise with Kubernetes and cloud infrastructure platforms (e.g., AWS, GCP, Azure, OVH) in production environments.
  • Extensive experience with Infrastructure as Code (IaC) tools such as Terraform, Pulumi, or Ansible.
  • Strong experience designing and managing CI/CD and GitOps deployment systems using tools like ArgoCD.
  • Hands-on experience with observability platforms including monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, Loki).
  • Strong understanding of networking fundamentals (TCP/IP, DNS, routing), security best practices, and cloud architecture principles.
  • Experience leading infrastructure, SRE, or platform engineering teams responsible for production systems.
  • Strong knowledge of containerized systems and deployment architectures supporting high availability and scalability.
  • Familiarity with database technologies such as MongoDB or Redis and their operational considerations at scale.
  • Experience supporting SaaS platforms with large-scale customer deployments.
  • Experience managing multi-cluster Kubernetes environments and multi-region architectures.
  • Ability to define and execute a long-term infrastructure vision aligned with company growth.

Requirements

  • Experience with open source software.
  • Active U.S. Security Clearance (or eligibility to obtain one) is a strong plus.

Soft Skills

  • Passion: Genuine enthusiasm for what you do and how it contributes to our company's mission.
  • Dream: Proactively seek out opportunities and challenges to achieve extraordinary results.
  • Own: Take ownership of your work, set high standards for yourself, and be accountable for outcomes.
  • Trust: Recognizing the importance of trust and support and actively working towards a collaborative and inclusive workplace.
  • Share: Communicating openly and transparently, ensuring clarity and honesty in interactions.

What You'll Do

  • Influence core product architecture (Core, Fleetcommand, Omnichannel, etc.) before code is written to ensure reliability, scalability, and operability are baked in by design.
  • Lead the engineering of systemic solutions that eliminate entire classes of failures, moving the organization from reactive firefighting to proactive prevention.
  • Act as the technical visionary for our deployment products (LaunchControl, Airlock, Launchpad), defining the long-term technical roadmap and architectural standards alongside the Head of Infrastructure.
  • Design, prototype, and build foundational tooling, core libraries, and frameworks (in Go, Python, etc.) that make it easier for both SREs and Product Engineers to deploy safely, monitor accurately, and operate efficiently.
  • Champion and evolve the Infrastructure as Code (IaC) paradigms (Pulumi, Terraform) to ensure they meet the needs of increasingly complex, multi-region, and air-gapped enterprise deployments.
  • Serve as the highest level of technical escalation for catastrophic, multi-domain Sev-1 incidents that baffle standard operational protocols.
  • Drive the strategic direction of incident management, ensuring that post-mortems result in structural, org-wide improvements rather than localized band-aids.
  • Evolve the company's disaster recovery (DR) and chaos engineering programs to simulate and defend against complex cascading failures.
  • Define, document, and enforce global standards for observability (SLIs, SLOs, error budgets), alerting, and production readiness across all engineering squads.
  • Author foundational Architectural Decision Records (ADRs) and Requests for Discussion (RFDs) that guide the technical direction of the company.
  • Act as a role model and technical mentor for Senior and Mid-Level SREs, as well as Senior Product Engineers, elevating the overall technical culture of Rocket.Chat.
  • Facilitate org-wide technical enablement sessions, knowledge sharing, and blameless culture advocacy.
  • Partner with Engineering leadership, Product, Security, and Customer Success to align infrastructure strategy with business and customer needs.
  • Represent Rocket.Chat’s technical vision through technical writing, conference talks, and community engagement within the infrastructure and open-source ecosystem.
  • Foster a culture of ownership, operational excellence, and continuous improvement across the infrastructure organization.

Benefits

  • Fully Remote & Flexible Working Hours
  • Flexible Paid Time Off, Holidays and Vacation
  • Company Laptop
  • Remote Benefit iTalki, Courses and Books
  • Stock Options
  • Multicultural Environment
  • Vibrant Company Culture

Check out our handbook to dive into each of our awesome benefits!

Before You Apply
️
πŸ‡ΊπŸ‡Έ Be aware of the location restriction for this remote position: USA Only
β€Ό Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Senior Staff Site Reliability Engineer @Rocket.Chat
DevOps / Sysadmin
Salary unspecified
Remote Location
πŸ‡ΊπŸ‡Έ USA Only
Job Type full-time
Posted 2wks ago
Apply for this position
Did not apply βœ“
Applied βœ“
Sent Follow-Up βœ“
Interview Scheduled βœ“
Interview Completed βœ“
Offer Accepted βœ“
Offer Declined βœ“
Unlock 152,720 Remote Jobs
️
πŸ‡ΊπŸ‡Έ Be aware of the location restriction for this remote position: USA Only
β€Ό Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Apply for this position
Did not apply βœ“
Applied βœ“
Sent Follow-Up βœ“
Interview Scheduled βœ“
Interview Completed βœ“
Offer Accepted βœ“
Offer Declined βœ“
Unlock 152,720 Remote Jobs
Γ—

Apply to the best remote jobs
before everyone else

Access 152,720+ vetted remote jobs and get daily alerts.

4.9 β˜…β˜…β˜…β˜…β˜… from 500+ reviews
Unlock All Jobs Now

Maybe later