Back to Remote jobs > Devops > devops engineer

Senior DevOps / Platform Reliability Engineer @Zingtree

Devops

Salary unspecified	Remote Location Worldwide
Employment Type full-time	Posted 3d ago

[Hiring] Senior DevOps / Platform Reliability Engineer @Zingtree

3d ago - Zingtree is hiring a remote Senior DevOps / Platform Reliability Engineer. 💸 Salary: unspecified 📍Location: Worldwide

Role Description

We’re hiring a Senior DevOps / Platform Reliability Engineer to own the platform that powers our agentic CX product. You’ll build the CI/CD, infrastructure, and observability backbone that enables us to ship multi-agent systems safely to enterprise customers.

If you want to operate a production AI platform and use AI to help operate it, this role is for you.

In this role, you will collaborate with development, operations, and infrastructure teams to:

Automate and streamline processes.
Build and maintain tools for deployment, monitoring, and operations.
Troubleshoot issues across development and production environments.

What You'll Do

Own and evolve CI/CD pipelines using GitHub Actions and OIDC-based authentication for microservices and agentic workloads, with safe, fast, and reversible deployments.
Automate infrastructure provisioning using Infrastructure as Code (IaC) tools such as Terraform and CloudFormation.
Operate and scale our Kubernetes platform (EKS + Argo CD), including:
- Autoscaling
- Ingress
- External-dns
- Cert-manager
- External Secrets Operator
- Backups
- Runtime guardrails
- Multi-tenant isolation for enterprise customers
Manage the edge and network perimeter, including:
- Cloudflare (CDN, WAF, Bot Management, DDoS protection, Zero Trust / Access)
- CloudFront
- API Gateway
- ALB/NLB
- Route 53
- Network security controls
Operate the data and event tier, including:
- Aurora MySQL
- ElastiCache/Redis
- S3
- MSK (Kafka)
With responsibility for backups, point-in-time recovery (PITR), and multi-AZ disaster recovery aligned to defined RTO/RPO objectives.
Build and maintain Lambda workloads where event-driven or serverless architectures are the right fit.
Build observability as a product using Prometheus, Grafana, and OpenTelemetry, including telemetry for LLM and agentic systems such as:
- Token cost
- Tool-call latency
- Evaluation signals
- Prompt/version tracking
Strengthen our security and compliance posture for SOC 2 and HIPAA, including:
- Least-privilege IAM
- SCPs
- Secrets management
- SAST/DAST
- Dependency and container scanning
- Image signing
- AWS Config
- Security Hub
- GuardDuty
- Inspector
- Evidence automation
Drive FinOps initiatives, including:
- Tagging standards
- Savings Plans and Reserved Instances
- Per-tenant and per-workload cost attribution
- LLM cost controls
Build and evolve our AI-native DevOps capabilities.
Partner with engineering teams to define platform standards, service templates, deployment best practices, and operational SLOs.
Monitor system performance and ensure reliability, scalability, and security across infrastructure and services.
Collaborate with software engineering teams to support continuous integration and continuous delivery best practices.
Document infrastructure, deployment processes, and operational standards to support knowledge sharing across the team.

Agentic AI in DevOps

You’ll help define how Zingtree uses agentic AI to operate and improve our platform using modern AI operational practices.

Responsibilities include:

Design and operate auto-remediation agents for common production toil such as:
- Certificate rotation
- Noisy pods
- Infrastructure drift
- Flaky CI pipelines
With human-in-the-loop (HITL) controls for any destructive or customer-impacting actions.
Use LLMs for incident triage and root cause analysis, including:
- Log and trace summarization
- Signal correlation
- First-draft postmortems that are always reviewed by humans
Connect AI agents to internal systems through the Model Context Protocol (MCP), including:
- GitHub
- Jira
- PagerDuty
- AWS
- Kubernetes
- Terraform
Using scoped credentials, audit logging, and allow-listed access.
Apply AI-driven observability techniques, including:
- Anomaly detection on metrics
- LLM-based log clustering
- Alert deduplication and summarization on top of Prometheus and OpenTelemetry
Establish operational guardrails such as:
- Prompt/version pinning
- Evaluation frameworks for agent behavior
- Cost and rate-limit controls
- Policy-as-code (OPA/Conftest) for AI-generated infrastructure changes
- Clearly defined blast-radius controls
Define best practices for AI coding assistants such as GitHub Copilot, Claude, and Amazon Q in infrastructure repositories, including:
- Review workflows
- Prompt design
- Restrictions on auto-merged changes
Treat AI components as production systems with:
- SLOs
- Observability
- On-call readiness
- Runbooks
- Rollback strategies for agents and prompts

Qualifications

5+ years of experience in DevOps, SRE, or Platform Engineering operating production systems on AWS.
Strong experience with CI/CD pipelines and tools such as GitHub Actions, GitLab CI, Jenkins, or CircleCI.
Hands-on experience operating production EKS environments, including autoscaling, ingress, secrets management, and cluster upgrades.
Strong AWS networking experience, including multi-account VPC design, subnets, routing, security groups, NACLs, Route 53, ACM, and load balancers.
Deep experience with Terraform and GitHub Actions, ideally using OIDC-based cloud authentication.
Experience with Aurora/RDS MySQL, Redis (ElastiCache), and S3, including backups, PITR, migrations, and lifecycle management.
Strong observability experience using Prometheus, Grafana, and OpenTelemetry.
Experience operating Argo CD at scale.
Experience with Infrastructure as Code tools such as Terraform, CloudFormation, or Ansible.
Experience managing Cloudflare services including WAF, Bot Management, Rate Limiting, and Zero Trust / Access, along with CloudFront.
Experience operating Kafka/MSK at scale, including topics, consumer groups, and schema registries.
Experience with Lambda and event-driven architectures.
Comfortable working with Python, Bash, and Linux systems.
Strong understanding of security best practices across IAM, KMS, secrets management, networking, and software supply chain security.
Familiarity with vulnerability scanning and compliance tooling.

Nice to Have

Experience operating LLM or ML workloads in production, including LiteLLM, Bedrock, pgvector, prompt caching, or evaluation systems.
Experience building or integrating MCP servers or deploying agent frameworks such as LangGraph or CrewAI in production environments.

Benefits

Competitive compensation packages
Comprehensive health benefits:
- 100% of employee premiums covered
- 75%–80% of dependent premiums covered for most health, dental, and vision plans
401(k) plans to support retirement planning (no employer matching currently)
Paid parental leave
Unlimited PTO
Flexible remote work from anywhere
Up to $200/month co-working reimbursement
Home office stipend:
- Up to $500 for home office setup
- $100/month for internet, phone, and related expenses

Company Description

Zingtree is the next-generation intelligent process automation platform reimagining customer experience operations for the world’s top support leaders. With 500+ customers, including Optum, Corpay, Sony, SharkNinja, and Allianz, we transform self-service, surface automation opportunities, and turn every agent into an expert.

Similar Remote Jobs

Senior DevOps Engineer • Lemon.io Lemon.io

Devops Americas Europe Asia Oceania

1wk ago
Apply See more >

Kickstart Your Job Search

⚡ 13,528 remote jobs added this week

You're seeing 0.4% of available roles

Unlock 160,000+ jobs →

Meet JobCopilot: Your Personal Al Job Hunter

Automatically Apply to Remote Jobs

Try it now →

Before You Apply

️

	Be aware of the location restriction for this remote position: Worldwide
‼	Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.

Back to Remote jobs > Devops > devops engineer

Senior DevOps / Platform Reliability Engineer @Zingtree

Devops

Salary unspecified	Remote Location Worldwide
Employment Type full-time	Posted 3d ago

Apply for this position

Unlock 160,000+ Remote Jobs

️

	Be aware of the location restriction for this remote position: Worldwide
‼	Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.

Apply for this position

Unlock 160,000+ Remote Jobs