Senior Reliability Engineer - Technology @Truelogic
Software Development
Salary highly competit..
Remote Location
Job Type full-time
Posted 1mth ago

[Hiring] Senior Reliability Engineer - Technology @Truelogic

1mth ago - Truelogic is hiring a remote Senior Reliability Engineer - Technology. 💸 Salary: highly competitive usd pay 📍Location: Latin America (LATAM)

Role Description

The Site Reliability Engineer plays a key role in operating, observing, and improving the reliability of existing distributed systems running on AWS and Kubernetes, with a strong emphasis on observability, operational maturity, and automated responses to system behavior.

  • Designs, implements, and continuously improves observability strategies across services, including metrics, logs, traces, alerts, and dashboards.
  • Focuses on understanding system behavior in production, identifying failure modes, performance bottlenecks, and reliability risks.
  • Evolves and maintains shared AWS CDK and CDK8s constructs, with emphasis on observability, autoscaling, and operational safeguards.
  • Maintains and operates core platform components such as VPC, EKS clusters, RDS, OpenSearch, and MSK.
  • Operates and enhances Kubernetes cluster addons such as ingress controllers, cert-manager, autoscalers, and monitoring, logging, and tracing stacks.
  • Defines and maintains SLIs, SLOs, and alerting strategies that clearly distinguish between symptoms, root causes, and actionable operational events.
  • Improves automated operational responses, including autoscaling, self-healing mechanisms, and runbook-driven remediation.
  • Ensures high reliability through structured alerting systems (Prometheus, CloudWatch), noise reduction, alert quality improvements, and recovery mechanisms.
  • Collaborates with engineering teams to investigate production incidents, perform root cause analysis, and drive long-term reliability improvements.
  • Owns CI/CD pipelines for Infrastructure as Code (IaC) and observability-related platform components.
  • Applies Site Reliability Engineering (SRE) principles—including observability-first design, error budgets, and operational readiness—to shared platform services.
  • Supports IAM roles, secrets management, and tenant isolation best practices.

Qualifications

  • 5+ years of experience in Site Reliability Engineering, Platform Engineering, or Infrastructure roles, with significant hands-on experience operating and supporting production systems.
  • Strong experience in observability operations, including defining metrics, logs, traces, dashboards, alerts, and reliability indicators for complex systems.
  • Hands-on experience with AWS services such as VPC, IAM, RDS, MSK, S3, and CloudWatch, as well as Kubernetes components like Helm, RBAC, and ServiceAccounts.
  • Fluency in Python and experience with Infrastructure-as-Code using AWS CDK, CDK8s, or equivalent frameworks.
  • Strong understanding of Prometheus, Grafana, alert tuning, alert fatigue reduction, and incident-driven monitoring improvements.
  • Experience improving existing systems rather than building greenfield infrastructure, with a focus on operational excellence and system reliability.
  • Proven track record of using observability data to drive automation, scaling decisions, and operational improvements.
  • Experience designing reusable infrastructure or observability patterns, or contributing to internal developer or platform tooling.
  • Experience supporting Spark on Kubernetes, Argo, or Kafka-based batch pipelines (nice to have).

Benefits

  • 100% Remote Work: Enjoy the freedom to work from the location that helps you thrive.
  • Highly Competitive USD Pay: Earn an excellent, market-leading compensation in USD.
  • Paid Time Off: Our paid time off policies ensure you have the chance to unwind and recharge.
  • Work with Autonomy: Enjoy the freedom to manage your time as long as the work gets done.
  • Work with Top American Companies: Grow your expertise working on innovative, high-impact projects.

Company Description

At Truelogic, we are a leading provider of nearshore staff augmentation services headquartered in New York. For over two decades, we’ve been delivering top-tier technology solutions to companies of all sizes, from innovative startups to industry leaders, helping them achieve their digital transformation goals.

  • Our team of 600+ highly skilled tech professionals, based in Latin America, drives digital disruption by partnering with U.S. companies on their most impactful projects.
  • A data-driven technology company that partners with high-growth brands to optimize customer acquisition and retention.
  • Collaborates with major platforms and agencies such as Shopify, Experian, TransUnion, and top media partners.
Before You Apply
remote Be aware of the location restriction for this remote position: Latin America (LATAM)
Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Senior Reliability Engineer - Technology @Truelogic
Software Development
Salary highly competit..
Remote Location
Job Type full-time
Posted 1mth ago
Apply for this position
Did not apply
Applied
Sent Follow-Up
Interview Scheduled
Interview Completed
Offer Accepted
Offer Declined
Unlock 152,720 Remote Jobs
remote Be aware of the location restriction for this remote position: Latin America (LATAM)
Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Apply for this position
Did not apply
Applied
Sent Follow-Up
Interview Scheduled
Interview Completed
Offer Accepted
Offer Declined
Unlock 152,720 Remote Jobs
×

Apply to the best remote jobs
before everyone else

Access 152,720+ vetted remote jobs and get daily alerts.

4.9 ★★★★★ from 500+ reviews
Unlock All Jobs Now

Maybe later