Role Description
We are building a next-generation AI-driven platform that enables developers, creators, and enterprises to design scalable, secure, and human-centered AI experiences. As adoption grows, we are looking for a Backend Ops Engineer to take full ownership of infrastructure, ensuring high reliability, performance, and cost efficiency.
This role is critical in centralizing DevOps and infrastructure responsibilities—enabling faster deployments, reducing operational risks, optimizing cloud costs, and building a strong foundation for scalability and compliance. You will also play a key role in integrating AI into infrastructure operations, shaping the future of intelligent DevOps systems.
Key Responsibilities
-
Initial Focus (First Quarter)
-
Implement AI-powered operations such as log analysis, automated infrastructure updates, and predictive scaling alerts
-
Benchmark cloud and edge services and prototype scalability improvements
-
Build self-healing infrastructure pipelines that demonstrate advanced AI capabilities
-
Ongoing Responsibilities
-
Design, automate, and manage cloud infrastructure using Terraform and AWS services (ECS/Fargate, RDS, S3, IAM)
-
Build and maintain CI/CD pipelines using GitHub Actions for efficient and reliable deployments
-
Implement observability and monitoring systems using tools like Prometheus, Grafana, OpenTelemetry, and Sentry
-
Manage containerization using Docker and troubleshoot performance issues under load
-
Collaborate with backend teams to ensure low-latency, cost-effective, and scalable systems
-
Continuously improve system reliability, uptime, and incident response processes
Growth Path
-
Progress into a Staff Platform Engineer or Lead SRE role
-
Own end-to-end platform infrastructure
-
Contribute to building enterprise-ready deployment frameworks
-
Help define best practices for AI-driven DevOps systems
Qualifications
-
2–3+ years of experience in DevOps or Site Reliability Engineering roles
-
Strong expertise in AWS (ECS/Fargate, RDS, S3, CloudWatch, IAM)
-
Hands-on experience with Terraform and infrastructure-as-code practices
-
Proven experience with CI/CD pipelines (GitHub Actions preferred)
-
Strong knowledge of Docker and containerized environments
-
Experience with observability tools such as Prometheus, Grafana, OpenTelemetry, or Sentry
-
Demonstrated ability to debug and resolve infrastructure issues under load
AI & Technical Mindset
-
Interest or experience in integrating AI into DevOps workflows
-
Exposure to LLM APIs or AI-driven automation tools is a plus
Nice to Have
-
Familiarity with compliance frameworks such as SOC 2 or GDPR
-
Experience with multi-cloud environments (AWS, GCP, Azure)
-
Proficiency in Python for scripting and automation
-
Knowledge of infrastructure security best practices
-
Experience with cloud platforms like DigitalOcean or cloud migration projects
Soft Skills
-
Strong ownership mindset with a proactive approach
-
Clear communication skills to explain technical trade-offs
-
Bias toward automation and continuous improvement
-
Ability to work effectively in fast-paced, evolving environments
Key Skills
-
AWS ECS / Fargate
-
Terraform
-
CI/CD Pipelines
-
Docker
-
Observability & Monitoring