Engineer Lead, Site Reliability @Zensar
DevOps / Sysadmin
Salary unspecified
Remote Location
Job Type full-time
Posted 2d ago

[Hiring] Engineer Lead, Site Reliability @Zensar

2d ago - Zensar is hiring a remote Engineer Lead, Site Reliability. 💸 Salary: unspecified 📍Location: India

Role Description

The Engineer Lead – Site Reliability Engineering (SRE) will provide technical and thought leadership to ensure the reliability, resiliency, scalability, and observability of mission‑critical platforms supporting Banking Solutions, Payments, and Capital Markets. This role blends advanced SRE practices, resiliency and chaos engineering, and service health governance with people and technical leadership. The Engineer Lead will define reliability standards, mentor teams, and partner closely with Engineering, DevOps, Security, and Product stakeholders to embed reliability as a core product feature.

What You Will Be Doing

  • SRE Leadership & Strategy
    • Act as the technical lead and reliability champion, driving SRE best practices across multiple teams and platforms.
    • Define and evangelize reliability standards, principles, and operational excellence frameworks.
    • Guide teams in balancing feature velocity with system reliability using error budgets and SLO-driven decision-making.
    • Mentor and coach engineers in SRE, observability, incident management, and automation.
  • Service Health, SLI/SLO & Observability
    • Define, implement, and govern Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets.
    • Establish standardized service health monitoring and reporting frameworks across platforms.
    • Design and maintain end-to-end observability solutions covering infrastructure, applications, APIs, and customer experience.
    • Drive reliability insights through dashboards, health scores, and executive-level metrics.
  • Resiliency Testing & Chaos Engineering
    • Lead resiliency engineering initiatives to validate system behavior under failure conditions.
    • Design and execute chaos engineering experiments to proactively identify weaknesses in architecture and operations.
    • Integrate resiliency testing into CI/CD pipelines and pre-production environments.
    • Partner with development and architecture teams to ensure systems are fault-tolerant, self-healing, and resilient by design.
  • Incident Management & Operational Excellence
    • Lead high-severity incident response efforts, providing clear technical and operational direction.
    • Establish and continuously improve incident management, escalation, and communication practices.
    • Drive blameless post-incident reviews, ensuring root causes are addressed and preventive actions are implemented.
    • Measure and improve operational KPIs such as MTTR, MTTD, and incident recurrence rates.
  • Automation, Platform Reliability & Cloud Operations
    • Champion automation-first approaches to reduce toil and manual intervention.
    • Oversee deployment pipelines, configuration management, and release reliability practices.
    • Guide teams on Infrastructure as Code (IaC), environment consistency, and cloud governance.
    • Ensure disaster recovery, backup, and failover strategies are tested and production-ready.
  • Cross-Functional Collaboration & Governance
    • Collaborate with Engineering, QA, DevOps, Security, Architecture, and Product teams to embed reliability into the SDLC.
    • Ensure platforms comply with security, regulatory, and audit requirements, especially in financial services environments.
    • Influence technical roadmaps to prioritize resiliency, stability, and customer experience.

Qualifications

  • Strong experience with Core SRE practices, including system reliability, incident management, automation, and observability.
  • Hands-on expertise in resiliency testing and chaos engineering methodologies.
  • Proven experience designing and operating SLI / SLO / Error Budget frameworks at scale.
  • Deep understanding of distributed systems, microservices architectures, and cloud-native platforms.
  • Experience with cloud platforms (AWS, Azure, and/or Google Cloud).
  • Hands-on experience with Docker and Kubernetes.
  • Expertise in monitoring, observability, and logging tools, such as:
    • Prometheus, Grafana, Datadog
    • Splunk, ELK Stack
  • Strong background in incident management, post-mortem facilitation, and production support.
  • Proficiency in automation and scripting (Python, Bash, Terraform, Ansible).
  • Experience managing and improving CI/CD pipelines (Jenkins, GitLab CI/CD, Azure DevOps).
  • Ability to lead technical discussions, influence decisions, and communicate effectively with senior stakeholders.
  • Strong ownership mindset with accountability for service reliability and customer outcomes.

Nice to Have (SRE+ Skills)

  • Experience with Harness Chaos Engineering (CE) or similar chaos engineering platforms.
  • Programming experience in Java, particularly for debugging, performance analysis, or building internal SRE tooling.
  • Experience implementing self-healing and auto-remediation workflows.
  • Exposure to banking, payments, or capital markets domains.
  • Familiarity with chaos engineering maturity models and reliability governance practices.

Benefits

  • We believe the best work happens when individuality is celebrated, growth is encouraged, and well-being is prioritized.
  • We are an equal employment opportunity (EEO) and affirmative action employer, committed to creating an inclusive workplace.
Before You Apply
remote Be aware of the location restriction for this remote position: India
Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Engineer Lead, Site Reliability @Zensar
DevOps / Sysadmin
Salary unspecified
Remote Location
Job Type full-time
Posted 2d ago
Apply for this position
Did not apply
Applied
Sent Follow-Up
Interview Scheduled
Interview Completed
Offer Accepted
Offer Declined
Unlock 152,720 Remote Jobs
remote Be aware of the location restriction for this remote position: India
Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Apply for this position
Did not apply
Applied
Sent Follow-Up
Interview Scheduled
Interview Completed
Offer Accepted
Offer Declined
Unlock 152,720 Remote Jobs
×

Apply to the best remote jobs
before everyone else

Access 152,720+ vetted remote jobs and get daily alerts.

4.9 ★★★★★ from 500+ reviews
Unlock All Jobs Now

Maybe later