|
Salary
unspecified
|
Remote
Location
|
|
Job Type
full-time
|
Posted
1mth ago
|
1mth ago - PwC is hiring a remote Gen AI Site Reliability Engineer. 💸 Salary: unspecified 📍Location: India
Industry/Sector
Not ApplicableSpecialism
Managed ServicesManagement Level
Senior AssociateJob Description & Summary
At PwC, our people in managed services focus on a variety of outsourced solutions and support clients across numerous functions. These individuals help organisations streamline their operations, reduce costs, and improve efficiency by managing key processes and functions on their behalf. They are skilled in project management, technology, and process optimization to deliver high-quality services to clients.Focused on relationships, you are building meaningful client connections, and learning how to manage and inspire others. Navigating increasingly complex situations, you are growing your personal brand, deepening technical expertise and awareness of your strengths. You are expected to anticipate the needs of your teams and clients, and to deliver quality. Embracing increased ambiguity, you are comfortable when the path forward isn’t clear, you ask questions, and you use these moments as opportunities to grow.
Examples of the skills, knowledge, and experiences you need to lead and deliver value at this level include but are not limited to:
AC - Staff - Experienced - GenAI Site Reliability Engineer
Role: GenAI Site Reliability Engineer
Level: AC - Staff - Experienced
Tower: AI Operations & Platform Support (AI Managed Services)
Experience:
Key Skills: Monitoring & Alerting; Incident Investigation; Troubleshooting; Automation/Scripting; Cloud Operations; GenAI Platform Operations
Educational Qualification: Bachelor’s degree in Computer Science/IT or relevant field (Master’s or relevant certifications preferred)
Work Location: Bangalore / Hyderabad, India (Remote)
Job Description
As an AC - Staff - Experienced GenAI Site Reliability Engineer, you will operate and improve monitoring for in-scope GenAI services and AI workloads, investigate incidents, and implement reliability improvements. You will build dashboards, tune alerts, document runbooks, and automate repetitive operational tasks to improve stability and reduce time to restore.
Key Responsibilities:
Monitoring, Alerting & Service Health:
Build and maintain dashboards and alerts for availability, latency, error rates, and overall service health for in-scope GenAI services.
Tune thresholds and alert routing to reduce noise and improve actionable detection, improving MTTA and MTTR.
Incident Triage, Investigation & Restoration:
Triage incidents, gather evidence, and perform structured troubleshooting using logs/metrics/traces and documented runbooks.
Execute restoration steps and coordinate escalations to platform owners, engineering teams, or vendors for complex issues.
Provide clear technical updates during live events and document resolution details for future reference and trend analysis.
Problem Prevention & Reliability Improvements:
Contribute to root-cause investigations and implement corrective actions (monitoring improvements, configuration changes, resilience enhancements).
Identify recurring failure modes and propose fixes that reduce repeat incidents and improve overall service stability.
Support verification of corrective actions by monitoring outcomes and validating that improvements reduce incident recurrence.
Performance Troubleshooting Support:
Assist with latency and error investigations by gathering diagnostics, isolating contributing factors, and proposing mitigations.
Partner with engineering teams to validate fixes and monitor post-deployment impact on service health and performance.
Automation & Scripting:
Automate diagnostics and routine operational tasks to reduce manual effort and improve consistency (scripts, repeatable checks, standardized steps).
Maintain and document operational scripts and ensure they are usable and supportable by the broader team.
Documentation & Knowledge Management:
Maintain runbooks, troubleshooting guides, and knowledge articles for frequent scenarios and standard operating procedures.
Document known issues, standard resolutions, and escalation paths to improve first-time fix rate and onboarding efficiency.
Change Readiness & Post-Change Validation:
Support operational readiness for changes by validating monitoring readiness, runbook updates, and post-change verification steps.
Execute post-change checks and report regressions or unexpected behavior promptly to ensure rapid remediation.
Continuous Improvement & Service Reporting Inputs:
Identify operational pain points and recommend improvements to monitoring, alerting, runbooks, and support workflows.
Provide inputs to service reporting on incident trends, recurring issues, and improvement opportunities related to GenAI reliability.
Quality, Controls & Operational Discipline:
Follow defined operational processes (incident, request, change) and maintain high-quality ticket hygiene and documentation discipline.
Comply with security and access controls for supported tools and environments; proactively raise operational risks or control gaps for mitigation.
Collaboration & Team Support:
Collaborate with peers and leads to coordinate workload, share knowledge, and support consistent execution standards across the pod.
Support onboarding and knowledge transfer by maintaining clear documentation and participating in team enablement activities.
Required Skills:
Hands-on experience supporting production services in a cloud environment, including monitoring, troubleshooting, and incident response.
Experience building dashboards and alerts and working with logs/metrics/traces to diagnose issues and reduce time to restore.
Strong analytical problem-solving skills and ability to implement reliability improvements and corrective actions in a controlled manner.
Experience working within ITIL-aligned processes (incident, request, change) and maintaining runbooks/knowledge articles.
Preferred: experience with ITSM and observability tooling (e.g., client ITSM and monitoring tools; ServiceNow, CloudWatch, Datadog, Splunk, New Relic). Familiarity with GenAI services (AWS Bedrock, OpenAI/ChatGPT Enterprise) is desirable. AWS certifications are highly preferred.
Managed Services- AI Services
At PwC, we relentlessly focus on working with our clients to bring the power of technology and humans together and create simple yet powerful solutions. We imagine a day when our clients can simply focus on their business, knowing that they have a trusted partner for their IT needs. Every day, we are motivated and passionate about making our clients better.
Within our Managed Services platform, PwC delivers integrated services and solutions that are grounded in deep industry experience and powered by the talent that you would expect from the PwC brand. The PwC Managed Services platform delivers scalable solutions that add more excellent value to our client’s enterprise through technology and human-enabled experiences. Our team of highly skilled and trained global professionals, combined with the latest advancements in technology and process, allows us to provide effective and efficient outcomes. With PwC’s Managed Services, our clients can focus on accelerating their priorities, including optimizing operations and accelerating outcomes. PwC brings a consultative first approach to operations, leveraging our deep industry insights, world-class talent, and assets to enable transformational journeys that drive sustained client outcomes. Our clients need flexible access to world-class business and technology capabilities that keep pace with today’s dynamic business environment.
Within our global Managed Services platform, we provide AI Managed Services where we focus more so on the evolution of our clients’ AI portfolio. Our focus is to empower our clients to navigate and capture the value of their application portfolio while cost-effectively operating and protecting their solutions. We do this so that our clients can focus on what matters most to your business: accelerating dynamic, efficient and cost-effective growth.
As a member of our AI Managed Service team, we are looking for candidates who thrive working in a high-paced work environment capable of working on a mix of critical Application Evolution Service offerings and engagement, including help desk support, enhancement and optimization work, as well as strategic roadmap and advisory level work. It will also be critical to lend experience and effort in helping win and support customer engagements from not only a technical perspective, but also a relationship perspective.
Travel Requirements
0%Job Posting End Date
|
|
Be aware of the location restriction for this remote position: India |
| ‼ | Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more. | ️
|
Salary
unspecified
|
Remote
Location
|
|
Job Type
full-time
|
Posted
1mth ago
|
|
|
Be aware of the location restriction for this remote position: India |
| ‼ | Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more. | ️
Access 152,720+ vetted remote jobs and get daily alerts.