Role Description
As a Senior Site Reliability Engineer, you will be a hands-on technical individual contributor embedded within the Core Systems team, responsible for the daily health, stability, and performance of our production environment. You will serve as a primary responder for production incidents, owning triage through resolution β including root cause analysis, infrastructure remediation, and order automation recovery. You will work directly alongside the Manager, Consumer Technology Site Reliability, and Helpdesk to handle day-to-day triage and fix responsibilities, enabling leadership to focus on strategic decisions and team direction. You will also partner with development teams to evaluate production risk before deployment.
Your essential job responsibilities will include the following:
-
Production Triage:
Includes all incidents surfaced via the #triage Slack channels, Datadog alerts, Rundeck failures, contact center reports, and proactive monitoring across all business units.
-
Incident Ownership:
Serve as the primary on-call responder for production incidents. Acknowledge, investigate, and drive issues to resolution with clear communication throughout the incident lifecycle.
-
Root Cause Analysis:
Lead RCA (Root Cause Analysis) for production failures, including order automation breakdowns, Gearman/worker queue degradation, API integration outages, batch job timeouts, and database performance events. Document findings with sufficient detail to support post-mortem review.
-
Hands-On Remediation:
Execute infrastructure-level remediation, including EC2 instance restarts, Gearman worker pool resets, Rundeck job recovery, order status resets, and inventory and pricing queue restoration.
-
Regression Identification:
Identify deployment-related regressions by correlating incident timelines to recent deployments. Initiate and coordinate revert requests with development teams when causal links are established.
-
Incident Coordination:
Direct cross-functional teams during active incidents β assigning investigation tasks, managing parallel workstreams, tracking affected order or customer counts, and keeping all stakeholders informed via Slack threads and JIRA ticket updates.
-
Focus Areas:
Monitor the entire Consumer Enterprise Group (CEG) Platform processing environment and proactively surface anomalies, enhancement opportunities, and risk areas to leadership.
-
Assist with data cleanup and order recovery operations following production incidents.
-
Support testing and validation of infrastructure changes prior to production deployment.
-
Ensure accurate and timely entry of incident details, findings, and resolutions into JIRA tracking systems.
-
Continue to develop expertise in the CEG codebase, third-party integrations, and operational tooling through working sessions and self-directed learning.
-
Attend improvement opportunities for personal growth and certifications that will enhance effectiveness in the role.
-
Other Duties as assigned.
Qualifications
-
5+ years in a Site Reliability Engineering, DevOps, or Production Support role at a software or e-commerce company.
-
Demonstrated ability to independently diagnose and resolve production incidents, including infrastructure-level failures (servers, queues, batch jobs, APIs).
-
Hands-on experience with AWS (EC2, CloudWatch, or equivalent) for day-to-day operational tasks.
-
Experience with Datadog, New Relic, PagerDuty, or equivalent platforms for monitoring, alerting, and incident detection.
-
Working knowledge of MySQL/relational databases for investigative queries and data validation. Ability to read and analyze complex SQL queries to diagnose production data issues.
-
Familiarity with PHP, Python, Bash, or similar languages sufficient to read, debug, and modify production scripts and automation jobs.
-
Experience with Rundeck, cron, or equivalent batch job management and monitoring tools.
Requirements
-
Problem-Solving
-
Composure
-
Accountability
-
Detail-Oriented
-
Adaptability
-
Collaborative
-
Proactive
-
Communication
-
Results Orientation
Physical Job Requirements
-
Continuous viewing from and inputting data to a computer screen
-
Talking through the computer for many meetings and one-to-one conversations
-
Sitting for long periods of time
-
Travel required (<10%)
Benefits
-
Competitive & comprehensive benefit package including paid time off, medical, dental, vision, and 401k match (50% on the dollar up to 7% of employee contribution).
-
Compensation offered for this position will depend on qualifications, experience, and geographic location.
-
Total compensation package may also include commission, bonus or profit sharing.