Back to Remote jobs > All Others > reliability engineer

Lead Site Reliability Engineer @Akka

All Others

Salary unspecified	Remote Location Worldwide
Employment Type full-time	Posted 3d ago

[Hiring] Lead Site Reliability Engineer @Akka

3d ago - Akka is hiring a remote Lead Site Reliability Engineer. 💸 Salary: unspecified 📍Location: Worldwide

Role Description

Akka Platform is a cloud-native PaaS that enables teams to build and operate AI-enhanced microservices at scale. The Lead SRE owns reliability, scalability, and security for a multi-tenant platform running customer workloads across EKS, GKE, and AKS.

Own Service Level Objectives/Service Level Indicators (SLOs/SLIs) and error budgets across multi-cloud clusters (EKS, GKE, AKS); drive blameless post-mortems and systemic remediation.
Lead capacity planning with our customers, cluster lifecycle management, and Kubernetes and database upgrade cycles.
Define and enforce runbooks, on-call rotations, and escalation paths for the wider engineering organisation.

Own and evolve the IaC layer: Helm charts, Crossplane compositions, and FluxCD GitOps pipelines.
Design and maintain cloud-resource provisioning workflows that span all three cloud providers, with consistent policy controls.

Architect and operate connectivity patterns: AWS PrivateLink / Transit Gateway, GCP NCC, Azure VNet Peering, and cross-region ingress with Contour/Envoy.
Maintain and evolve the Linkerd service mesh for mTLS, workload identity (OIDC), and zero-trust authorisation policies.
Drive PKI hygiene with cert-manager: root/intermediate CA rotation, ACME certificate lifecycle, and secret management via KMS-backed Kubernetes vaults.

Own the observability stack: Prometheus, Cortex (multi-tenant metrics), OpenTelemetry sidecars, centralised log pipelines, and Groundcover / Grafana dashboards.
Establish alerting standards and SLO-based alerting rules; ensure distributed traces are actionable across JVM, Rust, and Go workloads.
Actively participate in on-call and lead the technical response for platform-level incidents.

Set engineering standards and review infrastructure changes across the team.
Partner with Security, Product, and Application Engineering to translate reliability requirements into platform capabilities.
Grow a team of 3–5 SREs through code review, architecture sessions, and career conversations.

Qualifications

7+ years in SRE, platform engineering, or infrastructure engineering roles.
Deep, hands-on Kubernetes experience: operating and scaling clusters across at least two of GKE, EKS, and AKS in production.
Proven IaC ownership: Helm chart authoring, Crossplane provider/composition design, and GitOps with Flux or ArgoCD.
Strong multi-cloud networking: VPC design, private connectivity (PrivateLink, NCC, VNet Peering), and DNS (Route 53, Cloud DNS, Azure DNS, Cloudflare).
Production experience with a service mesh (Linkerd or Istio) and Envoy-based ingress.
Solid observability track record with Prometheus, distributed tracing (OpenTelemetry), and structured logging pipelines.
Experience securing Kubernetes clusters: RBAC, workload identity / OIDC, mTLS, and secret management with cloud KMS.
Comfortable reading and writing at least one systems language (Go, Rust, or similar) and shell scripting for automation and operator development.

Requirements

Experience writing Kubernetes Operators / custom controllers (Go preferred).
Familiarity with JVM workloads on Kubernetes – GC tuning, heap sizing, graceful shutdown.
Exposure to event-driven / event-sourcing architectures (Akka, Kafka, or similar).
Experience with Teleport for federated cluster access.
Background operating Cortex for long-term, multi-tenant metrics storage.
Knowledge of gRPC service design and debugging.

Benefits

Competitive salary and equity, benchmarked against senior/lead IC roles in your market.
Remote-first culture with flexible working hours.
Comprehensive health and wellness benefits.
Opportunities for professional development and continuous learning.
Collaborative, inclusive, and innovative company culture.
A team that has strong opinions, writes good documentation, and builds things they are proud of.

Kickstart Your Job Search

⚡ 13,565 remote jobs added this week

You're seeing 0.4% of available roles

Unlock 160,000+ jobs →

Meet JobCopilot: Your Personal Al Job Hunter

Automatically Apply to Remote Jobs

Try it now →

Before You Apply

️

	Be aware of the location restriction for this remote position: Worldwide
‼	Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.

Back to Remote jobs > All Others > reliability engineer

Lead Site Reliability Engineer @Akka

All Others

Salary unspecified	Remote Location Worldwide
Employment Type full-time	Posted 3d ago

Apply for this position

Unlock 160,000+ Remote Jobs

️

	Be aware of the location restriction for this remote position: Worldwide
‼	Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.

Apply for this position

Unlock 160,000+ Remote Jobs

[Hiring] Lead Site Reliability Engineer @Akka

Apply to the best remote jobsbefore everyone else

Apply to the best remote jobs
before everyone else