Staff Software Engineer - Grafana Databases, Managed Services @Grafana Labs

[Hiring] Staff Software Engineer - Grafana Databases, Managed Services @Grafana Labs

3d ago - Grafana Labs is hiring a remote Staff Software Engineer - Grafana Databases, Managed Services. πŸ’Έ Salary: gbp 103,958 - 124,750 per year πŸ“Location: European timezones, GMT (UTC+0), UTC-2, CAT (UTC-1), CET +/- 3 HOURS, GMT to GMT+4

Role Description

The Managed Services team is a newly formed squad within the Databases department. It owns and operates shared, production-critical infrastructure that powers Grafana Cloud’s next-generation database products (Mimir, Loki, and Tempo). Today, this includes operating 100+ WarpStream clusters across multiple cloud providers and regions, with continued growth anticipated for the future. WarpStream acts as the streaming backbone for ingestion and read/write decoupling across databases. It sits directly on the hot path for metrics, logs, and traces, handling high-throughput, multi-consumer workloads at massive scale.

In addition to streaming infrastructure, the team works closely with high-volume analytical and storage systems that power query-heavy and aggregation-heavy workloads, where latency, compression behavior, storage layout, and scaling characteristics matter deeply.

What You’ll Be Doing

  • Operate and evolve 100+ multi-cloud streaming clusters and related database infrastructure.
  • Diagnose and eliminate cross-layer failure modes (e.g., object storage latency, noisy neighbors, control-plane bottlenecks, query performance regressions, etc.).
  • Design safe upgrade and rollout strategies at scale.
  • Improve observability, automation, and operational ergonomics.
  • Partner closely with database and platform teams to ensure safe scaling, partitioning, consumer fan-out, and query performance.
  • Work directly with distributed systems behavior, Kubernetes scheduling dynamics, storage engines, compression trade-offs, etc.
  • Serve as a primary escalation point and on-call for relevant incidents.
  • Own the relationship with all system vendors, including WarpStream Labs and others.

At the Staff level

  • Help define and evolve the technical direction for operating WarpStream and adjacent shared database systems at scale.
  • Lead complex initiatives such as migrations, rollout improvements, and reliability investments.
  • Establish best practices around SLOs, scaling limits, failure isolation, and change safety.
  • Investigate and drive resolution of multi-layer incidents spanning storage, compute, networking, and control-plane dependencies.
  • Identify systemic risks across 100+ clusters and contribute architectural improvements that reduce recurring issues.
  • Improve systems toil and operational ergonomics with automation.
  • Partner with database and platform teams to align on strategy and long-term scalability.
  • Mentor and support engineers as the team matures.

What Makes You a Great Fit

  • Regular 1:1s with your manager and close collaboration with teammates across regions.
  • Defining and evolving SLO strategy for shared database infrastructure.
  • Setting standards for diagnosability across core streaming and database systems in production.
  • Leading complex initiatives across high-throughput, multi-cloud infrastructure.
  • Designing and promoting fault-tolerant architectural patterns.
  • Defining rollout, migration, and upgrade safety practices.
  • Partnering with database and platform engineering leaders.
  • Leading design discussions and reviewing PRs.
  • Raising the bar for practices across teams by mentoring engineers.
  • Playing a key role in high-impact incident response.

Requirements

  • 8+ years of engineering experience, including meaningful time in SRE, platform engineering, production engineering, infrastructure engineering, or distributed systems roles.
  • Experience with high-throughput streaming systems, analytical or storage backends, or large-scale database infrastructure.
  • Strong Kubernetes experience in AWS, GCP, or Azure, and familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.).
  • Experience leading or driving complex technical efforts.
  • Ability to influence technical direction and align teams around reliability improvements.
  • Strong understanding of distributed systems failure modes in multi-cloud environments.
  • Proficiency in at least one systems-oriented language (Go preferred, but not required).
  • Working knowledge of Linux internals, networking, cloud storage, and performance/scaling behavior.
  • Experience participating in blameless incident response and writing high-quality post-incident reviews.
  • Clear communicator who can collaborate across teams and work autonomously.
  • Intellectually curious, transparent, action-oriented, and kind.

Compensation & Rewards

  • In the United Kingdom, the Base compensation range for this role is GBP 103,958 - GBP 124,750.
  • Actual compensation may vary based on level, experience, and skillset as assessed in the interview process.
  • Benefits include equity, bonus (if applicable) and other benefits.
  • All roles include Restricted Stock Units (RSUs).

Why You’ll Thrive at Grafana Labs

  • 100% Remote, Global Culture.
  • Scaling Organization – Tackle meaningful work in a high-growth, ever-evolving environment.
  • Transparent Communication – Expect open decision-making and regular company-wide updates.
  • Innovation-Driven – Autonomy and support to ship great work and try new things.
  • Open Source Roots – Built on community-driven values that shape how we work.
  • Empowered Teams – High trust, low ego culture that values outcomes over optics.
  • Career Growth Pathways – Defined opportunities to grow and develop your career.
  • Approachable Leadership – Transparent execs who are involved, visible, and human.
  • Passionate People – Join a team of smart, supportive folks who care deeply about what they do.
  • In-Person onboarding - We want you to thrive from day 1.
  • Balance is Key - We operate a global annual leave policy of 30 days per annum.

Equal Opportunity Employer

We will recruit, train, compensate and promote regardless of race, religion, color, national origin, gender, disability, age, veteran status, and all the other fascinating characteristics that make us different and unique.

Before You Apply
️
remote Be aware of the location restriction for this remote position: European timezones, GMT (UTC+0), UTC-2, CAT (UTC-1), CET +/- 3 HOURS, GMT to GMT+4
β€Ό Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Staff Software Engineer - Grafana Databases, Managed Services @Grafana Labs Apply for this position
Did not apply βœ“
Applied βœ“
Sent Follow-Up βœ“
Interview Scheduled βœ“
Interview Completed βœ“
Offer Accepted βœ“
Offer Declined βœ“
Unlock 152,720 Remote Jobs
️
remote Be aware of the location restriction for this remote position: European timezones, GMT (UTC+0), UTC-2, CAT (UTC-1), CET +/- 3 HOURS, GMT to GMT+4
β€Ό Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Apply for this position
Did not apply βœ“
Applied βœ“
Sent Follow-Up βœ“
Interview Scheduled βœ“
Interview Completed βœ“
Offer Accepted βœ“
Offer Declined βœ“
Unlock 152,720 Remote Jobs
Γ—

Apply to the best remote jobs
before everyone else

Access 152,720+ vetted remote jobs and get daily alerts.

4.9 β˜…β˜…β˜…β˜…β˜… from 500+ reviews
Unlock All Jobs Now

Maybe later