[Hiring] Lead Site Reliability Engineer @Juniper Square
Lead Site Reliability Engineer @Juniper Square
Software Development
Salary unspecified
Remote Location
Employment Type full-time
Posted 2d ago

[Hiring] Lead Site Reliability Engineer @Juniper Square

2d ago - Juniper Square is hiring a remote Lead Site Reliability Engineer. πŸ’Έ Salary: unspecified πŸ“Location: Worldwide

Role Description

Own and drive the technical direction for your team's infrastructure systems, making architectural decisions that balance reliability, scalability, and cost.

  • Design systems of moderate to high complexity using distributed systems best practices; anticipate future use cases and minimize technical debt.
  • Conduct architectural reviews and advance design patterns across the organization.
  • Identify and implement improvements to existing software architecture; define and expand design patterns to solve common platform problems.
  • Define and enforce security best practices across team-owned systems; proactively surface gaps to senior leadership.

Reliability & Operational Excellence

  • Own the reliability posture of team-owned services β€” establish SLOs, monitor SLAs, and hold the team accountable to them.
  • Lead incident response for complex, multi-service issues; systematically debug, identify root causes, and ensure issues do not recur.
  • Establish standards for logging, monitoring, and operationalization across all team-owned systems.
  • Foresee potential operational issues and implement preventative measures to safeguard the customer experience.
  • Participate in and help lead the on-call rotation; ensure production systems are appropriately instrumented.

Project & Delivery Ownership

  • Act as DRI (Directly Responsible Individual) for medium-to-large SRE projects spanning months and involving cross-team collaboration.
  • Partner with Engineering Managers and Product Managers to scope roadmap initiatives, break down work into actionable increments, and commit to delivery plans.
  • Negotiate scope effectively when required, ensuring adjustments align with customer needs and project goals.
  • Proactively identify and resolve project risks β€” dependencies, architectural drift, and staffing blockers β€” before they impact delivery.

Qualifications

  • 7-10 years of experience in Site Reliability Engineering, DevOps, or Platform Engineering in a production cloud environment.
  • 5+ years of hands-on experience with AWS cloud services across compute, networking, storage, and security.
  • 5+ years managing Linux-oriented production environments at scale.
  • 5+ years using Infrastructure-as-Code (Terraform, CDK, CloudFormation) and/or GitOps best practices.
  • 3+ years operating and troubleshooting production Kubernetes environments.
  • 3+ years applying AWS Well-Architected Framework principles across reliability, security, performance, and cost pillars.
  • 3+ years in cloud security best practices including IAM, secrets management, network security, and compliance.
  • 3+ years working with PostgreSQL in production: performance tuning, replication, backup, and recovery.
  • Demonstrated track record of leading multi-person technical projects from scoping through delivery.

Technical Skills

  • Strong general programming skills; comfort writing automation scripts and tooling in Python, Go, or similar.
  • Deep knowledge of observability tooling β€” metrics, logging, distributed tracing β€” and how to use them to drive reliability.
  • Solid understanding of data retention, backup, and recovery processes across cloud-native systems.
  • Experience with CI/CD pipelines, release management, and deployment automation.
  • Familiarity with service mesh, API gateway patterns, and microservices architectures.

AI Fluency

  • Experience using AI-assisted workflows across the SDLC, with an emphasis on production reliability, operability, and maintainability of large-scale systems (design, deployment, monitoring, incident response).
  • Hands-on experience integrating LLMs or AI systems into production environments, with a focus on reliability, latency, observability, and failure handling (e.g., automated triage, incident copilots, runbook automation).
  • Familiarity with agent-based or workflow automation systems applied to operational use cases such as alert triage, remediation loops, system diagnostics, or automated runbook execution.
  • Demonstrated ability to apply AI tools to improve system reliability, reduce MTTR, automate operational workflows, and enhance observability and alerting systems.
  • Working knowledge of LLMs, embeddings, RAG, and their operational constraints in production systems (latency, cost, drift, safety, and observability).
  • Ability to identify opportunities where AI can meaningfully improve system reliability, on-call efficiency, incident response, and infrastructure automation.

Leadership & Collaboration

  • Proven ability to lead technical discussions, drive alignment across engineering and product, and communicate decisions clearly to stakeholders.
  • Experience mentoring junior and mid-level engineers in both technical skills and professional development.
  • Able to operate independently with minimal supervision; comfortable making final technical decisions as DRI.
  • Strong communication skills in English β€” written and verbal β€” with experience influencing cross-functional partners.

Benefits

  • High-impact role at the intersection of cloud infrastructure and financial technology β€” your work directly underpins products managing hundreds of billions in AUM.
  • Significant growth potential: opportunity to help shape the SRE practice and prepare the platform for exponential scale.
  • A promising technology roadmap spanning capacity planning, Kubernetes migrations, and service-oriented architecture modernization.
  • Collaborative, engineering-driven culture that values quality, curiosity, and ownership.
  • Competitive compensation and benefits package.
Before You Apply
️
worldwide Be aware of the location restriction for this remote position: Worldwide
β€Ό Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Lead Site Reliability Engineer @Juniper Square
Software Development
Salary unspecified
Remote Location
Employment Type full-time
Posted 2d ago
Apply for this position
Did not apply βœ“
Applied βœ“
Sent Follow-Up βœ“
Interview Scheduled βœ“
Interview Completed βœ“
Offer Accepted βœ“
Offer Declined βœ“
Unlock 145,000+ Remote Jobs
️
worldwide Be aware of the location restriction for this remote position: Worldwide
β€Ό Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Apply for this position
Did not apply βœ“
Applied βœ“
Sent Follow-Up βœ“
Interview Scheduled βœ“
Interview Completed βœ“
Offer Accepted βœ“
Offer Declined βœ“
Unlock 145,000+ Remote Jobs
Γ—

Apply to the best remote jobs
before everyone else

Access 145,000+ vetted remote jobs and get daily alerts.

4.9 β˜…β˜…β˜…β˜…β˜… from 500+ reviews
Unlock All Jobs Now

Maybe later