Role Description
We’re looking for a Senior Platform Engineer/SRE to own reliability, automation, and infrastructure-as-code for our modern Data & AI platform. In this role, you’ll ensure our Azure-based data ecosystem is reliable, scalable, and efficient. You’ll build Terraform-first infrastructure, improve developer experience, and support a healthcare environment where uptime and data reliability directly impact patient care.
Essential Duties and Responsibilities
-
Infrastructure as Code
-
Build and maintain Terraform modules for data platform services (Snowflake, Airbyte, Astronomer, dbt, Kafka).
-
Develop IaC standards, GitOps workflows, and automated CI/CD pipelines using GitHub Actions.
-
Migrate manual configurations to fully codified infrastructure and enable self‑service provisioning for engineers.
-
Platform Reliability & Operations
-
Implement monitoring, alerting, and SLO/SLIs for data pipelines and platform components.
-
Lead incident response, root cause analysis, and postmortems.
-
Create automation, runbooks, and self‑healing capabilities to reduce MTTR.
-
Cross-Cloud Architecture
-
Design secure connectivity patterns between Azure and AWS vendor systems.
-
Troubleshoot networking, VPN, private endpoints, DNS, and MFT integrations.
-
Automation & Developer Experience
-
Build CI/CD pipelines using GitHub Actions for infrastructure changes with comprehensive testing (terraform plan, validate, compliance checks).
-
Implement policy-as-code using tools like Sentinel, OPA, or Azure Policy integrated into GitHub workflows.
-
Develop testing frameworks for infrastructure code (Terratest, kitchen-terraform) with automated execution in GitHub Actions.
-
Improve abstractions and tooling to streamline development workflows.
-
Performance & Cost Optimization
-
Optimize Snowflake compute usage and Airflow/dbt performance.
-
Apply cloud cost management practices and tagging strategies.
-
Support capacity planning and forecasting.
-
Systems Troubleshooting & Problem Resolution
-
Lead complex troubleshooting efforts across distributed systems spanning multiple cloud providers.
-
Debug integration issues with Kafka streams, CDC patterns, and real-time data pipelines.
-
Resolve platform-wide incidents involving Snowflake, Astronomer, Airbyte, and downstream BI tools (PowerBI, Tableau).
-
Partner with vendors for escalated support cases and coordinate resolution across multiple teams.
Qualifications
-
7+ years in Site Reliability Engineering, DevOps, or Platform Engineering roles.
-
5+ years owning Terraform at scale, including module design, lifecycle management, and supporting multiple teams through IaC abstractions.
-
Deep, hands-on Azure expertise is required (designing and operating production systems, not just familiarity). AWS experience is a plus.
-
Experience operating cloud-based data platforms (Snowflake, Airflow, etc.).
-
Strong hands-on experience operating Kubernetes in production environments, including cluster management, networking, scaling, and reliability.
-
Expert GitHub knowledge (pull requests, Actions, branching strategies).
-
Strong troubleshooting skills across distributed systems, networking, and data pipelines.
-
Proficient in Python, Bash, PowerShell; able to read SQL and YAML/JSON.
Bonus Qualifications
-
Healthcare data experience (FHIR, HL7, claims data).
-
Kafka experience, dbt administration, BI tools (PowerBI/Tableau).
-
Experience with data quality frameworks and synthetic data generation.
-
Policy-as-code tools (Sentinel, OPA, Checkov).
Benefits
-
Base salary range: $155,000 - $175,000.
-
Eligible for a performance bonus and benefits (subject to eligibility requirements).
-
Flexible Vacation Policy for rest, relaxation, and personal time.
-
80 hours of Paid Sick, Safe, and Caregiver Leave annually.