[Hiring] Lead Site Reliability Engineer @Akka
Lead Site Reliability Engineer @Akka
All Others
Salary unspecified
Remote Location
Employment Type full-time
Posted 3d ago

[Hiring] Lead Site Reliability Engineer @Akka

3d ago - Akka is hiring a remote Lead Site Reliability Engineer. πŸ’Έ Salary: unspecified πŸ“Location: Worldwide

Role Description

Akka Platform is a cloud-native PaaS that enables teams to build and operate AI-enhanced microservices at scale. The Lead SRE owns reliability, scalability, and security for a multi-tenant platform running customer workloads across EKS, GKE, and AKS.

  • Own Service Level Objectives/Service Level Indicators (SLOs/SLIs) and error budgets across multi-cloud clusters (EKS, GKE, AKS); drive blameless post-mortems and systemic remediation.
  • Lead capacity planning with our customers, cluster lifecycle management, and Kubernetes and database upgrade cycles.
  • Define and enforce runbooks, on-call rotations, and escalation paths for the wider engineering organisation.
  • Own and evolve the IaC layer: Helm charts, Crossplane compositions, and FluxCD GitOps pipelines.
  • Design and maintain cloud-resource provisioning workflows that span all three cloud providers, with consistent policy controls.
  • Architect and operate connectivity patterns: AWS PrivateLink / Transit Gateway, GCP NCC, Azure VNet Peering, and cross-region ingress with Contour/Envoy.
  • Maintain and evolve the Linkerd service mesh for mTLS, workload identity (OIDC), and zero-trust authorisation policies.
  • Drive PKI hygiene with cert-manager: root/intermediate CA rotation, ACME certificate lifecycle, and secret management via KMS-backed Kubernetes vaults.
  • Own the observability stack: Prometheus, Cortex (multi-tenant metrics), OpenTelemetry sidecars, centralised log pipelines, and Groundcover / Grafana dashboards.
  • Establish alerting standards and SLO-based alerting rules; ensure distributed traces are actionable across JVM, Rust, and Go workloads.
  • Actively participate in on-call and lead the technical response for platform-level incidents.
  • Set engineering standards and review infrastructure changes across the team.
  • Partner with Security, Product, and Application Engineering to translate reliability requirements into platform capabilities.
  • Grow a team of 3–5 SREs through code review, architecture sessions, and career conversations.

Qualifications

  • 7+ years in SRE, platform engineering, or infrastructure engineering roles.
  • Deep, hands-on Kubernetes experience: operating and scaling clusters across at least two of GKE, EKS, and AKS in production.
  • Proven IaC ownership: Helm chart authoring, Crossplane provider/composition design, and GitOps with Flux or ArgoCD.
  • Strong multi-cloud networking: VPC design, private connectivity (PrivateLink, NCC, VNet Peering), and DNS (Route 53, Cloud DNS, Azure DNS, Cloudflare).
  • Production experience with a service mesh (Linkerd or Istio) and Envoy-based ingress.
  • Solid observability track record with Prometheus, distributed tracing (OpenTelemetry), and structured logging pipelines.
  • Experience securing Kubernetes clusters: RBAC, workload identity / OIDC, mTLS, and secret management with cloud KMS.
  • Comfortable reading and writing at least one systems language (Go, Rust, or similar) and shell scripting for automation and operator development.

Requirements

  • Experience writing Kubernetes Operators / custom controllers (Go preferred).
  • Familiarity with JVM workloads on Kubernetes – GC tuning, heap sizing, graceful shutdown.
  • Exposure to event-driven / event-sourcing architectures (Akka, Kafka, or similar).
  • Experience with Teleport for federated cluster access.
  • Background operating Cortex for long-term, multi-tenant metrics storage.
  • Knowledge of gRPC service design and debugging.

Benefits

  • Competitive salary and equity, benchmarked against senior/lead IC roles in your market.
  • Remote-first culture with flexible working hours.
  • Comprehensive health and wellness benefits.
  • Opportunities for professional development and continuous learning.
  • Collaborative, inclusive, and innovative company culture.
  • A team that has strong opinions, writes good documentation, and builds things they are proud of.
Before You Apply
️
worldwide Be aware of the location restriction for this remote position: Worldwide
β€Ό Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Lead Site Reliability Engineer @Akka
All Others
Salary unspecified
Remote Location
Employment Type full-time
Posted 3d ago
Apply for this position
Did not apply βœ“
Applied βœ“
Sent Follow-Up βœ“
Interview Scheduled βœ“
Interview Completed βœ“
Offer Accepted βœ“
Offer Declined βœ“
Application Denied βœ“
Unlock 160,000+ Remote Jobs
️
worldwide Be aware of the location restriction for this remote position: Worldwide
β€Ό Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Apply for this position
Did not apply βœ“
Applied βœ“
Sent Follow-Up βœ“
Interview Scheduled βœ“
Interview Completed βœ“
Offer Accepted βœ“
Offer Declined βœ“
Application Denied βœ“
Unlock 160,000+ Remote Jobs
Γ—

Apply to the best remote jobs
before everyone else

Access 160,000+ vetted remote jobs and get daily alerts.

4.9 β˜…β˜…β˜…β˜…β˜… from 500+ reviews
Unlock All Jobs Now

Maybe later