Senior Staff Site Reliability Engineer @Rocket.Chat

DevOps / Sysadmin

Salary unspecified	Remote Location 🇺🇸 USA Only
Job Type full-time	Posted 2wks ago

[Hiring] Senior Staff Site Reliability Engineer @Rocket.Chat

2wks ago - Rocket.Chat is hiring a remote Senior Staff Site Reliability Engineer. 💸 Salary: unspecified 📍Location: USA

Role Description

As a Senior Staff Site Reliability Engineer, you will be responsible for the overall reliability, scalability, and operational excellence of Rocket.Chat’s infrastructure and deployment systems. You will lead the infrastructure strategy, guide the platform roadmap, and ensure that all systems run reliably and efficiently across our global deployments. This includes overseeing:

Cloud infrastructure
Kubernetes platforms
Deployment automation
Monitoring systems
Operational processes

You will work closely with Engineering, Security, Product, and Leadership teams to ensure that infrastructure capabilities support product growth, customer demands, and operational resilience. Your leadership will ensure that reliability engineering, automation, and operational best practices are embedded into the development lifecycle across the company.

Qualifications

Strong background in software engineering and infrastructure architecture with experience designing and operating large-scale distributed systems.
Expert understanding of microservices, event-driven architectures, stateful vs. stateless scaling constraints, and data consistency models.
Advanced coding proficiency (Go preferred, Python acceptable) capable of building complex core frameworks and contributing to the core Rocket.Chat codebase when necessary.
Deep expertise with Kubernetes and cloud infrastructure platforms (e.g., AWS, GCP, Azure, OVH) in production environments.
Extensive experience with Infrastructure as Code (IaC) tools such as Terraform, Pulumi, or Ansible.
Strong experience designing and managing CI/CD and GitOps deployment systems using tools like ArgoCD.
Hands-on experience with observability platforms including monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, Loki).
Strong understanding of networking fundamentals (TCP/IP, DNS, routing), security best practices, and cloud architecture principles.
Experience leading infrastructure, SRE, or platform engineering teams responsible for production systems.
Strong knowledge of containerized systems and deployment architectures supporting high availability and scalability.
Familiarity with database technologies such as MongoDB or Redis and their operational considerations at scale.
Experience supporting SaaS platforms with large-scale customer deployments.
Experience managing multi-cluster Kubernetes environments and multi-region architectures.
Ability to define and execute a long-term infrastructure vision aligned with company growth.

Requirements

Experience with open source software.
Active U.S. Security Clearance (or eligibility to obtain one) is a strong plus.

Soft Skills

Passion: Genuine enthusiasm for what you do and how it contributes to our company's mission.
Dream: Proactively seek out opportunities and challenges to achieve extraordinary results.
Own: Take ownership of your work, set high standards for yourself, and be accountable for outcomes.
Trust: Recognizing the importance of trust and support and actively working towards a collaborative and inclusive workplace.
Share: Communicating openly and transparently, ensuring clarity and honesty in interactions.

What You'll Do

Influence core product architecture (Core, Fleetcommand, Omnichannel, etc.) before code is written to ensure reliability, scalability, and operability are baked in by design.
Lead the engineering of systemic solutions that eliminate entire classes of failures, moving the organization from reactive firefighting to proactive prevention.
Act as the technical visionary for our deployment products (LaunchControl, Airlock, Launchpad), defining the long-term technical roadmap and architectural standards alongside the Head of Infrastructure.
Design, prototype, and build foundational tooling, core libraries, and frameworks (in Go, Python, etc.) that make it easier for both SREs and Product Engineers to deploy safely, monitor accurately, and operate efficiently.
Champion and evolve the Infrastructure as Code (IaC) paradigms (Pulumi, Terraform) to ensure they meet the needs of increasingly complex, multi-region, and air-gapped enterprise deployments.
Serve as the highest level of technical escalation for catastrophic, multi-domain Sev-1 incidents that baffle standard operational protocols.
Drive the strategic direction of incident management, ensuring that post-mortems result in structural, org-wide improvements rather than localized band-aids.
Evolve the company's disaster recovery (DR) and chaos engineering programs to simulate and defend against complex cascading failures.
Define, document, and enforce global standards for observability (SLIs, SLOs, error budgets), alerting, and production readiness across all engineering squads.
Author foundational Architectural Decision Records (ADRs) and Requests for Discussion (RFDs) that guide the technical direction of the company.
Act as a role model and technical mentor for Senior and Mid-Level SREs, as well as Senior Product Engineers, elevating the overall technical culture of Rocket.Chat.
Facilitate org-wide technical enablement sessions, knowledge sharing, and blameless culture advocacy.
Partner with Engineering leadership, Product, Security, and Customer Success to align infrastructure strategy with business and customer needs.
Represent Rocket.Chat’s technical vision through technical writing, conference talks, and community engagement within the infrastructure and open-source ecosystem.
Foster a culture of ownership, operational excellence, and continuous improvement across the infrastructure organization.

Benefits

Fully Remote & Flexible Working Hours
Flexible Paid Time Off, Holidays and Vacation
Company Laptop
Remote Benefit iTalki, Courses and Books
Stock Options
Multicultural Environment
Vibrant Company Culture

Check out our handbook to dive into each of our awesome benefits!

Similar Remote Jobs

Senior DevOps Engineer • Marketerx Marketerx

DevOps / Sysadmin $130k - $150k USA Only

2wks ago
Apply See more >

Kickstart Your Job Search

⚡ 12,726 remote jobs added this week

You're seeing 0.4% of available roles

Unlock 152,720 jobs →

Meet JobCopilot: Your Personal Al Job Hunter

Automatically Apply to Remote Jobs

Try it now →

Before You Apply

️

🇺🇸	Be aware of the location restriction for this remote position: USA Only
‼	Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.

Back to Remote jobs > DevOps / Sysadmin

Senior Staff Site Reliability Engineer @Rocket.Chat

DevOps / Sysadmin

Salary unspecified	Remote Location 🇺🇸 USA Only
Job Type full-time	Posted 2wks ago

Apply for this position

Unlock 152,720 Remote Jobs

️

🇺🇸	Be aware of the location restriction for this remote position: USA Only
‼	Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.

Apply for this position

Unlock 152,720 Remote Jobs

[Hiring] Senior Staff Site Reliability Engineer @Rocket.Chat

Apply to the best remote jobsbefore everyone else

Apply to the best remote jobs
before everyone else