Role Description
The Machine Learning Platform team at Reddit is a high-impact team that owns the infrastructure that powers recommendations, content discovery, user and content quantification, while directly impacting other teams such as Growth, Ads, Feeds, and Core Machine Learning teams.
As a Senior Staff Software Engineer, you will help define and lead the vision for Redditโs large-scale GenAI Platform, shaping the strategy, architecture, and operating model that enable teams across the company to build, deploy, and scale generative AI products with confidence.
-
Contribute to the design, implementation, and maintenance of the LLM Gateway, focusing on features like unified API endpoints for internal/externally hosted LLM, rate/token limit management, and intelligent failover mechanisms to boost uptime and reliability.
-
Lead and execute the vision, strategy, and roadmap for Redditโs large-scale GenAI Platform.
-
Define the platform architecture and operating model that enable teams to build, deploy, and scale GenAI products reliably.
-
Drive the strategy for a unified LAG Gateway supporting internally and externally hosted LLMs through consistent APIs and abstractions.
-
Set the direction for core platform capabilities such as rate and token limit management, intelligent failover, and production resilience.
-
Shape Redditโs approach to an enterprise-grade RAG system.
-
Establish the strategic direction for agentic AI workflows and tool-use patterns across the platform.
-
Own the end-to-end platform strategy from concept through production adoption and long-term evolution.
-
Drive MLOps and LLMOps standards across CI/CD, testing, versioning, evaluation, and lifecycle management.
-
Define best practices for observability, monitoring, governance, and operational excellence across GenAI systems.
-
Partner across engineering, product, and leadership to align platform investments with company priorities and user needs.
-
Champion platform thinking with a strong focus on scalability, reliability, performance, and developer experience.
-
Influence technical direction across teams by turning emerging AI capabilities into a scalable platform strategy.
Qualifications
-
10+ years of experience in ML Engineering, AI Platform Engineering, or Cloud AI Deployment roles.
-
Have a track record of leading technical strategy and delivering AI platforms in cloud-based production environments at scale.
-
Demonstrate strong execution by turning strategy into action, driving complex initiatives end to end, and consistently delivering high-quality platform outcomes.
-
Bring deep experience operating Kubernetes and other orchestration systems in large-scale production environments.
-
Deep experience with cloud-based technologies for supporting an ML platform, including tools like AWS, Google Cloud Storage, infrastructure-as-code (Terraform), and more.
-
Proficiency with the common programming languages and frameworks of ML, such as Go, Python, etc.
-
Excellent communication skills with the ability to articulate technical AI concepts to non-technical stakeholders.
-
Strong focus on scalability, reliability, performance, and developer experience.
-
Strong knowledge of model serving, inference pipelines, monitoring, and observability for AI systems is a plus.
Benefits
-
Comprehensive Healthcare Benefits and Income Replacement Programs.
-
401k with Employer Match.
-
Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support.
-
Family Planning Support.
-
Gender-Affirming Care.
-
Mental Health & Coaching Benefits.
-
Flexible Vacation & Paid Volunteer Time Off.
-
Generous Paid Parental Leave.