Principal Site Reliability Engineer @Arcadia
DevOps / Sysadmin
Salary $180,000 - $230..
Remote Location
🇺🇸 USA Only
Job Type full-time
Posted 1wk ago

[Hiring] Principal Site Reliability Engineer @Arcadia

1wk ago - Arcadia is hiring a remote Principal Site Reliability Engineer. 💸 Salary: $180,000 - $230,000 a year 📍Location: USA

Role Description

Arcadia’s customers rely on us to securely process and deliver high-value healthcare insights. Reliability, availability, performance, and security are foundational to trust—especially when systems support critical workflows and handle PHI. As a Principal Site Reliability Engineer, you’ll set reliability strategy across teams, drive cross-cutting platform improvements, and ensure we can scale delivery without scaling operational burden.

What Success Looks Like

  • In 3 months:
    • Build deep context on Arcadia’s platform, production risks, and operational practices.
    • Participate in on-call/incident response and quickly improve signal quality for at least one critical domain (dashboards, alerts, traces, runbooks).
    • Identify a high-leverage reliability initiative and align stakeholders on scope, success metrics, and milestones.
  • In 6 months:
    • Establish SLOs/error budgets for key customer journeys.
    • Drive operational readiness standards for launches.
    • Lead remediation for recurring incidents with measurable reductions in customer impact and MTTR.
    • Deliver major toil-reduction improvements via automation and self-service workflows.
  • In 12 months:
    • Own and execute a reliability program with cross-org impact.
    • Influence architecture decisions and establish org-wide operational standards.
    • Mentor Staff engineers—raising the reliability and security bar across Arcadia.

What You'll Be Doing

  • Act as the technical leader for reliability for one or more domains; set direction and standards while remaining hands-on where it matters most.
  • Drive reliability strategy across critical services: define SLOs/SLIs, error budgets, and reliability KPIs aligned to customer journeys and outcomes.
  • Own incident response maturity: lead complex incidents, improve incident command practices, and ensure high-quality RCAs with prioritized, tracked remediation.
  • Architect and implement automation to reduce toil and risk: runbook automation, self-service tools, and safe operational workflows (Python + Argo Workflows).
  • Advance GitOps delivery practices using Argo CD: promotion strategies, progressive delivery/canaries, and guardrails that reduce deploy risk.
  • Scale infrastructure management with Crossplane and Terraform: reusable patterns, policy controls, and paved roads for teams.
  • Lead operational readiness and reliability reviews for new features/architectural changes; reinforce non-functional requirements (availability, latency, security, cost).
  • Improve performance and cost efficiency through capacity planning, load testing, right-sizing, and architecture recommendations across AWS services.
  • Champion infrastructure security best practices for environments that handle PHI (least privilege, secrets management, auditability, and defense-in-depth).
  • Mentor Staff and Senior engineers through design reviews, code reviews, pairing, and documentation; raise reliability standards across teams.

Qualifications

  • 8+ years of experience in SRE, platform engineering, systems engineering, or related roles operating production services at scale.
  • Demonstrated principal-level impact: leading cross-team initiatives, influencing architecture decisions, and driving sustained improvements in reliability and operations.
  • Expertise in Kubernetes operations and troubleshooting, including safe rollout/rollback patterns, workload debugging, and operational guardrails.
  • Strong GitOps experience with Argo CD; experience building delivery workflows and automation using Argo Workflows.
  • Strong infrastructure orchestration and provisioning experience with Crossplane and Terraform; ability to define reusable platform patterns and controls.
  • Deep AWS experience (IAM, networking/VPC, compute, storage, managed services, observability) and strong understanding of reliability and failure modes in cloud systems.
  • Proficiency in Python for building automation, tooling, and reliability improvements.
  • Strong incident management and on-call leadership experience, including measurable improvements (availability, MTTR, alert quality, cost, or operational maturity).
  • Excellent communication skills: can translate technical risk and reliability tradeoffs to engineering leadership, product, and stakeholders; produces high-quality docs/runbooks.

Would Love For You To Have

  • Experience with ScyllaDB or similar distributed databases (e.g., Cassandra) and their reliability/performance characteristics.
  • Experience with Spark or data processing platforms, including reliability and cost considerations for large-scale workloads.
  • Familiarity with agentic coding practices and principles (safe automation, reviewable changes, guardrail-first workflows).
  • Strong infrastructure security knowledge: threat modeling for cloud/Kubernetes, RBAC/IAM design, secrets management, supply chain security, and security observability.

Principal Engineer Competencies

  • Customer Focus: champions customer impact; drives SLO definition with product partners; participates in incidents to limit customer impact; may engage customers to understand problems.
  • Technical Leadership: leading cross-team technical representative; negotiates interfaces; anticipates edge cases; designs telemetry for availability and reliability.
  • Total Ownership: owns outcomes from requirements and design through production support; transitions complex changes with multi-phase rollouts and long-term ownership.
  • Effective Communication: communicates to diverse audiences; finalizes key documentation (runbooks, guides, FAQs); synthesizes standards and best practices.
  • Proactive Leadership: coaches senior/peer teams primarily through review; delegates appropriately; sets clear expectations (Definition of Done) and improves service processes/rotations.

Benefits

  • Be a part of a mission driven company that is transforming the healthcare industry by changing the way patients receive care.
  • A flexible, remote friendly company with personality and heart.
  • Employee driven programs and initiatives for personal and professional development.
  • Become a member of the talented, energized, diverse and purpose-driven Arcadian Community.
  • $180,000 - $230,000 a year.
Before You Apply
🇺🇸 Be aware of the location restriction for this remote position: USA Only
Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Principal Site Reliability Engineer @Arcadia
DevOps / Sysadmin
Salary $180,000 - $230..
Remote Location
🇺🇸 USA Only
Job Type full-time
Posted 1wk ago
Apply for this position
Did not apply
Applied
Sent Follow-Up
Interview Scheduled
Interview Completed
Offer Accepted
Offer Declined
Unlock 152,720 Remote Jobs
🇺🇸 Be aware of the location restriction for this remote position: USA Only
Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Apply for this position
Did not apply
Applied
Sent Follow-Up
Interview Scheduled
Interview Completed
Offer Accepted
Offer Declined
Unlock 152,720 Remote Jobs
×

Apply to the best remote jobs
before everyone else

Access 152,720+ vetted remote jobs and get daily alerts.

4.9 ★★★★★ from 500+ reviews
Unlock All Jobs Now

Maybe later