Service Engineer @Microsoft
DevOps / Sysadmin
Salary unspecified
Remote Location
remote India
Job Type full-time
Posted YDay

[Hiring] Service Engineer @Microsoft

YDay - Microsoft is hiring a remote Service Engineer. 💸 Salary: unspecified 📍Location: India

You will also lead the evolution of Azure's Incident Management practice through Post-Incident Reviews, process development, and system automation. By leveraging telemetry and metrics, you will identify and drive platform-wide improvements with global impact. You'll be the single point of command and control during high-severity incidents, orchestrating cross-functional engineering, operations, and communications to minimize impact, restore services quickly, and protect the trust of our global customer base. This role offers a unique opportunity to make immediate impact, improve systems at scale. 5+ years' proven expertise in mission-critical cloud operations, high-severity incident response, SRE, or large-scale systems engineering on hyperscale platforms like Azure, AWS, or GCP. Must have Service Engineering experience in a 24 x 7 x 365 enterprise environments Deep understanding of cloud architecture patterns, microservices, and containerization. Demonstrated ability to make decisions quickly, under pressure, and with limited data—without compromising long-term reliability. Familiarity with monitoring and observability tools (e.g., Grafana, Prometheus, Datadog, Splunk, New Relic). Contribute to Implement observability frameworks to proactively detect performance bottlenecks. Strong knowledge of CI/CD pipelines, container orchestration (Kubernetes, Docker), and infrastructure as code (Terraform, ARM, Bicep). Familiarity with AI/ML frameworks and cloud AI services. Experience implementing AI-driven monitoring, alerting, and remediation systems Fluency in one or more automation languages (PowerShell, Python, CLI etc.) Understanding ITIL or other incident management frameworks is a must. Understand High Availability, Disaster Recovery, Business Continuity, Performance Tuning Demonstrates strategic thinking, quantitative and analytical skills, team leadership, and collaboration Excellent problem resolution, judgment, negotiating and decision-making skills Desired Strong knowledge of Windows Platform or Linux, developer tools and ability to diagnose and debug user code Effectively manage and prioritize multiple tasks in accordance with high level objectives/projects. Excellent communication skill (written + verbal) in English, especially in high-pressure scenarios. Ability to communicate with a variety of audiences; including high-profile customers, executive management, and engineering teams. Experience with Azure, AWS, or GCP core services and their interdependence. Bachelor's or master's degree in computer science, Information Technology or equivalent experience 8+ Years of demonstrated experience as an Incident Commander or Crisis Manager for critical, high-severity incidents in high-availability, distributed environments. Experience with SRE (Site Reliability Engineering) principles and practices. Exposure to chaos engineering, fault injection, or high availability architecture. AI/ML Experience: [Beginner to Intermediate] Familiarity with how AI/ML models are integrated into cloud infrastructure and their potential failure modes. Experience using AI-powered tools for incident analysis, log correlation, or predictive alerting. An understanding of the challenges and risks associated with AI/ML systems in a production environment. Certifications: Relevant cloud certifications (e.g., AWS Certified DevOps Engineer, Azure Solutions Architect, GCP Professional Cloud Architect). Certifications in ITIL, SRE, or other relevant frameworks. To be successful in this role, you must have a great track record of customer compassion, an engineering mindset, an innate aptitude for agility, and technical excellence in software engineering. Collaborate closely with Engineering/PM to ensure the availability, performance of Live Site and the satisfaction of our customers Manage high-severity incidents (SEV0/SEV1/SEV2) across Azure services, serving as the single point of accountability to ensure rapid detection, triage, resolution, and customer communication. Act as the central authority during live site incidents, driving real-time decision-making and coordination across Engineering, Support, PM, Communications, and Field teams. Participate in the on-call rotation. Provide calm, decisive leadership in crisis situations, escalating as needed to senior leadership. Promote a customer-first culture by prioritizing availability, reliability, and platform trust in every response. Contribute in analyzing customer-impacting signals from telemetry, support cases, and feedback to identify root causes, drive incident reviews (RCAs/PIRs), and implement preventative service improvements. Contribute to Azure platform improvements by incorporating learnings from live site events and customer feedback, ensuring improved reliability, observability, and supportability. Collaborate closely with Engineering and Product teams to influence and implement service resiliency enhancements, auto-remediation tools, and customer-centric mitigation strategies. Contribute to the development and adoption of incident response playbooks, mitigation levers, and operational frameworks aligned to real-world support scenarios and strategic customer needs Contribute to the design of next-generation architecture for cloud infrastructure services with a focus on reliability and strategic customer support outcomes. Build and maintain cross-functional partnerships, ensuring alignment across engineering, business, and support organizations. Apply engineering mindset to operational challenges, balancing agility, scalability, and technical quality in collaboration with peers
Before You Apply
remote Be aware of the location restriction for this remote position: India
Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Service Engineer @Microsoft
DevOps / Sysadmin
Salary unspecified
Remote Location
remote India
Job Type full-time
Posted YDay
Apply for this position Unlock 73,106 Remote Jobs
remote Be aware of the location restriction for this remote position: India
Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Apply for this position Unlock 73,106 Remote Jobs
×
  • Unlock 73,106 hidden remote jobs.
  • Your shortcut to remote work. Apply before everyone else.
  • Click and apply. No middlemen, no hassle.
  • Filter by location/skills/salary…
  • Create custom email alerts
Unlock All Jobs Now