Role Description
We are looking for a Lead Site Reliability Engineer to join our Cloud Engineering division. Cloud Engineering ensures the continuous availability of the technologies and systems that are the foundation of athenahealthโs services.
-
Define, measure, and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for cloud services and infrastructure components.
-
Lead efforts to continuously improve system availability, fault tolerance, and disaster recovery capabilities.
-
Ensure proactive incident detection, efficient root cause analysis, and timely resolution of production incidents.
-
Participate in a 12x7 on-call rotation.
-
Drive automation efforts to reduce manual intervention and streamline cloud infrastructure management.
-
Implement Infrastructure as Code (IaC) using tools like Terraform, AWS CloudFormation, and Ansible.
-
Automate deployment, scaling, and monitoring processes to improve efficiency and reduce operational complexity.
-
Design and implement monitoring, logging, and alerting solutions to track cloud infrastructure health, performance, and security.
-
Use observability tools (e.g., Prometheus, Grafana, Cloud Watch) to ensure continuous visibility into cloud infrastructure performance and capacity.
-
Identify bottlenecks and performance issues, proposing and implementing improvements to ensure optimal resource usage.
-
Ensure that cloud infrastructure is built with security best practices in mind and meets all relevant compliance and regulatory requirements.
-
Collaborate with security teams to implement security controls and risk mitigation strategies across cloud environments.
-
Regularly audit and review cloud infrastructure for security vulnerabilities and compliance gaps.
-
Work closely with development, DevOps, and operations teams to ensure cloud infrastructure aligns with application and business requirements.
-
Lead and mentor a team of Site Reliability Engineers, promoting best practices and fostering a culture of operational excellence.
-
Act as a key technical point of contact for cloud-related infrastructure and operations issues.
-
Lead the incident response efforts for cloud infrastructure-related issues.
-
Conduct post-incident reviews (PIRs) to identify root causes and implement preventive measures.
-
Continuously refine incident management processes to reduce downtime and enhance recovery times.
Qualifications
-
10 years of hands-on experience with cloud automation and configuration management tools (e.g., Terraform, AWS CloudFormation, Ansible, Puppet).
-
7+ years of experience in a Site Reliability Engineering (SRE), Infrastructure Engineering, or DevOps role, with at least 3+ years in a technical leadership capacity.
-
Deep knowledge of cloud services and technologies (e.g., EC2, S3, Lambda, Kubernetes, etc.).
-
Proficiency in scripting or programming languages (Python, Go, Bash, etc.).
-
Experience with monitoring, logging, and observability tools (e.g., Prometheus, Grafana, Datadog, ELK stack).
-
Familiarity with Continuous Integration/Continuous Deployment (CI/CD) pipelines and cloud-native development practices.
-
Strong expertise in managing cloud infrastructure (AWS, Google Cloud, Azure) in production environments.
-
Experience with cloud-native architectures, microservices, and containerized environments (Kubernetes, Docker).
-
Proven experience in building and managing highly available, scalable, and fault-tolerant systems in the cloud.
-
Strong understanding of cloud networking, storage, compute services, On-Prem and security best practices.
-
Strong knowledge of Linux administration and internals.
-
Effective communication skills, with the ability to translate technical concepts to non-technical stakeholders.
Preferred Qualifications
-
Bachelorโs degree in Computer Science, Engineering, or a related field.
-
Knowledge of database systems such as MySQL, Oracle or PostgreSQL.
-
Experience with managing on-prem infrastructure at scale.
-
Certifications in AWS, RedHat5 or relevant technologies are a plus.
-
Experience running containerized workloads (Kubernetes, Docker) in production.
Expected Compensation
$119,000 - $203,000. The base salary range shown reflects the full range for this role from minimum to maximum. At athenahealth, base pay depends on multiple factors, including job-related experience, relevant knowledge and skills, how your qualifications compare to others in similar roles, and geographical market rates.
Benefits
-
Health and financial benefits.
-
Perks specific to each location, including commuter support, employee assistance programs, tuition assistance, and employee resource groups.
-
Flexible work-life balance with options for remote work.
-
Events throughout the year, including book clubs, external speakers, and hackathons.