Role Description
We’re looking for an infrastructure-focused engineer who thrives at the intersection of machine learning, systems, and product delivery. This is a hands-on role responsible for deploying, monitoring, and scaling a real-time ML-powered content moderation system used to detect and triage abuse, threats, and edge-case language. You’ll work closely with ML engineers, researchers, and clients to build infrastructure that makes high-performance models accessible and reliable in the wild.
-
Design and maintain cloud infrastructure (GCP or AWS) to support real-time model serving, data ingestion, and evaluation workflows.
-
Deploy and optimize APIs for low-latency access to ML models and embedding search systems.
-
Manage and optimize the end-to-end training data flow—from sourcing and cleaning datasets to preparing them for model consumption—ensuring accuracy, scalability, and efficiency.
-
Build observability tooling for production ML pipelines (monitor latency, error rates, request volumes, drift).
-
Automate model deployment, retraining, and evaluation pipelines (CI/CD for ML).
-
Work with ML engineers to package models for serving.
-
Help manage vector databases and semantic search infrastructure (e.g., Pinecone, FAISS, Vertex Matching Engine).
-
Ensure security, compliance, and uptime of infrastructure supporting safety-critical systems.
Qualifications
-
3–8 years of experience deploying machine learning systems or high-availability backend systems.
-
Experience shipping and maintaining production infrastructure at scale, supporting ML workflows.
-
Experience with GCP, AWS, or similar platforms (including managed ML services).
-
Proficient in Terraform, Docker, Kubernetes, or similar infra tools.
-
Understanding of performance tradeoffs in serving models and embedding search pipelines.
-
Ability to work cross-functionally with ML, security, and product teams to deploy safely and iterate fast.
-
Builder's mindset and bias for ownership in ambiguous environments.
Requirements
-
Experience with vector databases or ANN systems, preferably within GCP (or AWS).
-
Experience serving LLMs or embedding-based models via API.
-
Experience with model monitoring, logging, and metrics platforms (e.g., Prometheus, Grafana, Sentry).
-
Familiarity with trust & safety infrastructure, abuse detection, or policy enforcement systems.
Benefits
-
Salary Range: $130K–$230K, depending on experience and location.
-
Performance-based annual bonus.
-
Support for continuing education, conferences, or training.
-
Fully remote, U.S.-based work environment.
-
Comprehensive health, dental, and vision coverage.
-
Generous PTO and paid holiday schedule.
-
401(k) plan.