Lead Site Reliability Engineer @Hitachi Solutions

[Hiring] Lead Site Reliability Engineer @Hitachi Solutions

Apr 09, 2025 - Hitachi Solutions is hiring a remote Lead Site Reliability Engineer. 💸 Salary: $142,500 - $198,750 usd. 📍Location: USA.

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more.

Role Description

This is a full-time role in our product organization for an expert in systems design with considerable skill and expertise in large software development in an AZURE dev environment.

  • Designs and implements Continuous Integration/Continuous Deployment (CI/CD) tooling using GitHub Actions / Azure DevOps, and related technologies.
  • Defines and implements build and test pipelines for containerized architectures.
  • Implements infrastructure as code (IaC) for the stateful deployment of environments.
  • Manages Role-Based Access Control (RBAC), linting and other code quality controls, gitops and Kubernetes pipelines, and managing SaaS deployment APIs.
  • Assists in the design, engineering, development, planning and administration of Azure Kubernetes AKS clusters for critical business applications.
  • Works closely with application, engineering, security and operations teams to engineer and build Kubernetes and Azure PaaS & IaaS solutions within an agile and modern enterprise grade operating model.
  • Demonstrates capability to learn new concepts quickly, and/or has robust domain expertise.

Qualifications

  • Responsible for availability, latency, performance, efficiency, monitoring/observability, emergency response, capacity planning, setting, and maintaining SLOs, SLIs and Error Budgets, creating dashboards.
  • Analyze, troubleshoot, and resolve operational challenges contributing to defined SLO's.
  • Manage site stability, performance, reliability, and maintain uptime for production environments.
  • Develop a fully automated multi-environment observability stack based on the existing system and extend it to predict capacity needs based on the usage patterns.
  • Strive for automation to reduce toil and increase development velocity.
  • Perform application-specific production support, incident management, change management, problem management, RCAs, and service restoration as needed.
  • Identify changes for the product architecture from the reliability, performance and availability perspective with a data driven approach.
  • Analyze and address complex technical challenges and issues that arise during the software development & run lifecycle.
  • Create and maintain technical documentation, including design specifications, user guides, run books and best practice guidelines.
  • Actively look for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation.
  • Collaborate with software development teams in the release management process and to shape the future roadmap and establish strong operational readiness across teams.
  • Participate in Agile ceremonies, such as sprint planning, stand-up meetings, and retrospectives.
  • Collaborate with product managers, designers, and other engineers to ensure alignment and efficient project execution.
  • Share your expertise and mentor engineers, helping them grow and develop their skills.
  • Stay updated with the latest technologies, tools, and cloud computing.
  • Collaborate with customers to understand their needs, gather feedback, and provide technical support and guidance as needed.
  • Triage incoming Web Support escalation requests routing to applicable internal teams.
  • Contribute to incident root cause analysis, service restoration, and serve as an incident commander during outage events.
  • Strong background as a SRE supporting a 24x7 highly available production environment for a SaaS or cloud service provider.
  • Solid experience with Monitoring/APM/Observability tools (Data dog, Application Insights, Prometheus, Grafana etc.).
  • Strong background with Azure Resources like Key Vault, Data Factory, Azure Databricks and Storage Accounts.
  • Experience implementing observability plans around logs, metrics, and traces.
  • Experience in an agile development team developing software.
  • Implement and participate exercising best practices for CI/CD.
  • Experience with cloud infrastructure environments, preferably Azure, and Infrastructure as code (Terraform, Bicep, ARM).
  • Design, develop, and maintain infrastructure using popular IaC tools and technologies like Terraform, Helm, others.
  • Strong experience with containerization technology and/or Kubernetes.
  • Experience with Release automation, system administration, configuration management.
  • Experience with programming languages (Python, Go, etc.).
  • Strong understanding of Linux, Windows, software development, systems, networking, and cloud concepts.
  • Strong interpersonal and teaming skills - ability to set and enforce process and influence engineers who are not direct reports.
  • Strong analytical and programming skills (Python, Go etc.).
  • (Bonus) Experience with MLFlow and other MLOps pipeline technology.

Requirements

  • Continuous Integration/Continuous Deployment (CI/CD)
  • Instrumentation strategy and Site Reliability Engineering (SRE)
  • Release Communication and Collaboration
  • Security and Compliance
  • TDD (Test Driven Development, especially with respect to CI/CD and DevOps)

Benefits

  • Base Salary Pay Range: $142,500 - $198,750 USD
  • Bonus Plan
  • Medical, Dental and Vision Coverage
  • Life Insurance and Disability Programs
  • Retirement Savings with Company Match
  • Paid Time Off
  • Flexible Work Arrangements including Remote Work

Similar Remote Jobs

More jobs at Hitachi Solutions

More Devops / Sysadmin jobs

More jobs in USA

Before You Apply
️
📍 Be aware of the location restriction for this remote position: USA
‼ Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Lead Site Reliability Engineer @Hitachi Solutions
Devops / Sysadmin
Salary đź’¸ $142,500 - $198,750 usd
Remote Location
USA
Job Type full-time
Posted Apr 09, 2025
Apply for this position Unlock 53,879 Remote Jobs
️
📍 Be aware of the location restriction for this remote position: USA
‼ Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Lead Site Reliability Engineer Apply for this position Unlock 53,879 Remote Jobs
Ă—
  • Unlock 53,879 hidden remote jobs.
  • Your shortcut to remote work. Apply before everyone else.
  • Click and apply. No middlemen, no hassle.

We’re not like the other sites. Come see why!

50% off in April 2025
  • Single payment
  • Lifetime access
  • Filter by location/skills/salary…
  • Create custom email alerts
  • Private Slack Community