Senior Site Reliability Engineer @TechInsights
DevOps / Sysadmin
Salary $125,200 - $132..
Remote Location
Job Type full-time
Posted 3d ago

[Hiring] Senior Site Reliability Engineer @TechInsights

3d ago - TechInsights is hiring a remote Senior Site Reliability Engineer. πŸ’Έ Salary: $125,200 - $132,500 cad πŸ“Location: Canada

Role Description

TechInsights is building the reliability and AI operations foundation for its next chapter β€” an AI-first intelligence platform that runs the most demanding semiconductor intelligence workflows in the world. We're looking for a Senior Site Reliability Engineer who wants to own that foundation.

This is a senior individual contributor role at the technical leadership tier of our Site Reliability Engineering team. You'll own strategic reliability initiatives end-to-end:

  • Setting technical direction
  • Defining SLOs and error budgets across our production platform
  • Designing reliability patterns for the AI agent pipelines
  • Enabling our development and AI Engineering teams to build and ship with confidence

What sets this role apart is its scope. You're not just keeping the lights on β€” you're building the observability, Internal Developer Platform (IDP), and service catalog that a fast-scaling AI platform needs from day one.

Qualifications

  • Bachelor's degree in Computer Science, Engineering, or equivalent combination of education and experience
  • 6–8 years of progressive experience in site reliability engineering, platform engineering, or DevOps, with demonstrated technical leadership at the senior individual contributor level
  • Deep expertise in AWS (EKS, Lambda, CloudWatch, AWS Config) and multi-region architecture patterns
  • Proficiency with Terraform and GitOps; experience with policy-as-code (Sentinel, OPA/Rego, or equivalent)
  • Hands-on Datadog experience at operational depth: dashboards, SLO tracking, alerting, log management, distributed tracing
  • Strong containerization expertise: Docker, Kubernetes (EKS preferred)
  • Proficiency in Python and/or Bash; experience building operational tooling; solid understanding of Java and Spring Boot microservice architecture sufficient to make reliability and deployment decisions for EKS-hosted services
  • Deep expertise in CI/CD pipeline design and optimization using Bitbucket Pipelines and GitHub Actions
  • Familiarity with IDP tooling (Backstage, Atlassian Compass, or equivalent) strongly preferred
  • Experience with AI/ML workload infrastructure, LLM API integration, or agentic system operations considered a strong asset

Requirements

  • Own SLOs, SLIs, and error budgets for all production services; drive error budget discipline across engineering
  • Design reliability patterns for AI agent pipelines: LLM observability, tool-use tracking, failure detection, and graceful degradation
  • Architect for blast radius containment β€” agent failures must have bounded customer impact through isolation, circuit breaking, and rapid recovery
  • Mature our Canada Central/West active-active architecture toward 24-hour RTO with full regional failover
  • Lead incident response and post-incident reviews that produce durable fixes; maintain DR procedures through regular testing
  • Serve as the primary reliability liaison to Software and AI Engineering, translating requirements into actionable standards
  • Partner with AI Engineering on compute provisioning, model serving, inference latency, and workload isolation
  • Own CI/CD pipeline strategy (Bitbucket Pipelines, GitHub Actions) β€” set standards, optimize deployment frequency, and ensure teams can ship confidently
  • Drive IDP adoption and enable teams on SRE practices: on-call readiness, SLO definition, runbook development, and self-service tooling
  • Represent reliability in architectural discussions; surface risk before it's committed to design
  • Own the service catalog β€” a living inventory of all services, AI agents, dependencies, ownership, and SLOs
  • Operate Datadog as the single pane of glass for service health, infrastructure, and agentic pipeline telemetry
  • Extend observability to AI workloads: LLM latency, token consumption, agent completion rates, and pipeline throughput
  • Build golden path templates in Backstage and/or Atlassian Compass so teams ship reliably without routine SRE involvement
  • Apply AIOps in Datadog to automate anomaly detection, incident triage, and remediation recommendations
  • Own infrastructure as code via Terraform and GitOps; enforce IaC policy in partnership with Trust Assurance
  • Own FinOps visibility into AWS cost segments; model cloud cost impact as AI/ML workloads scale
  • Formally mentor junior and intermediate SRE engineers, with accountability for their technical growth and career progression
  • Build AI-assisted automation to progressively reduce toil and scale the team's operational capacity

Benefits

  • Company-sponsored training and development opportunities
  • Comprehensive benefits package (health, dental, vision, wellness, RRSP Matching, annual fitness reimbursement)
  • Flexible vacation policy
  • Community involvement opportunities through charitable alliances
  • Wellness resources and support
  • Inclusive environment that prioritizes diversity, equity, and accessibility
  • High-growth company driven by high performance
  • Expected salary range: $125,200 - $132,500 CAD

Working Arrangement

This is a remote position for candidates based in Canada. Occasional travel may be required.

Technology knows no bounds, and neither does TechInsights. Bringing together talented humans from different perspectives, backgrounds and abilities is something we take seriously. We’re committed to building an inclusive environment that welcomes you to be your authentic self and allows us to push past the boundaries together.

TechInsights is committed to meeting the needs of people with disabilities. Accommodations are available on request for candidates taking part in all aspects of the selection process.

AI technology may be used to assist in the screening and assessment of applications for this position. Our recruiters are involved at every stage, and all hiring decisions are made by People and hiring teams.

As part of any recruitment process, TechInsights collects and processes personal data relating to job applicants. We are committed to being transparent about how we collect and use that data and to meeting our data protection obligations.

Before You Apply
️
remote Be aware of the location restriction for this remote position: Canada
β€Ό Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Senior Site Reliability Engineer @TechInsights
DevOps / Sysadmin
Salary $125,200 - $132..
Remote Location
Job Type full-time
Posted 3d ago
Apply for this position
Did not apply βœ“
Applied βœ“
Sent Follow-Up βœ“
Interview Scheduled βœ“
Interview Completed βœ“
Offer Accepted βœ“
Offer Declined βœ“
Unlock 152,720 Remote Jobs
️
remote Be aware of the location restriction for this remote position: Canada
β€Ό Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Apply for this position
Did not apply βœ“
Applied βœ“
Sent Follow-Up βœ“
Interview Scheduled βœ“
Interview Completed βœ“
Offer Accepted βœ“
Offer Declined βœ“
Unlock 152,720 Remote Jobs
Γ—

Apply to the best remote jobs
before everyone else

Access 152,720+ vetted remote jobs and get daily alerts.

4.9 β˜…β˜…β˜…β˜…β˜… from 500+ reviews
Unlock All Jobs Now

Maybe later