Role Description
Deepgram's infrastructure spans bare metal GPU clusters, multi-cloud deployments, and global edge presence -- all serving real-time voice AI at massive scale while simultaneously powering large-scale model training. As a Systems Architect, you will own the end-to-end infrastructure architecture that makes this possible. You will:
-
Define and drive the end-to-end infrastructure architecture for Deepgram's AI/ML workloads across production inference and research training.
-
Design multi-cloud and hybrid infrastructure strategies that balance performance, reliability, cost, and vendor flexibility.
-
Architect compute orchestration systems that efficiently schedule and manage GPU and CPU workloads across heterogeneous infrastructure.
-
Design storage architectures that handle the massive datasets required for speech and audio ML -- from high-throughput training data pipelines to low-latency model serving.
-
Lead capacity planning across all infrastructure dimensions, modeling growth and ensuring Deepgram can scale ahead of demand.
-
Drive cost optimization and FinOps practices, identifying opportunities to reduce infrastructure spend without compromising performance or reliability.
-
Design burstable, elastic training infrastructure that can scale up for large training runs and scale down to minimize idle cost.
-
Architect research compute infrastructure that gives ML teams the resources they need while maintaining operational efficiency.
-
Establish architectural standards, design review processes, and technical documentation practices for infrastructure decisions.
-
Collaborate with engineering leadership to align infrastructure strategy with product roadmap and business objectives.
-
Evaluate emerging hardware, cloud services, and infrastructure technologies for potential adoption.
Qualifications
-
7+ years of experience in infrastructure engineering, systems architecture, or a senior technical role focused on large-scale infrastructure.
-
Proven experience designing multi-cloud architectures spanning AWS and at least one other major cloud provider or on-premises environment.
-
Deep expertise in storage system design -- block, object, and file storage, including performance tuning for large-scale data workloads.
-
Strong experience with compute orchestration using Kubernetes, and an understanding of how to schedule diverse workloads efficiently.
-
Hands-on experience with GPU infrastructure -- procurement considerations, cluster design, driver and runtime management.
-
Track record of capacity planning and infrastructure scaling for high-growth environments.
-
Ability to communicate complex architectural decisions clearly to both technical and non-technical stakeholders.
-
Strong understanding of networking fundamentals as they relate to infrastructure architecture.
Requirements
-
Direct experience architecting infrastructure for ML training workloads -- distributed training, large dataset management, experiment infrastructure.
-
Background in cost optimization and FinOps practices for large-scale cloud and bare metal infrastructure.
-
Experience operating and managing bare metal infrastructure in colocation facilities.
-
Expertise in network architecture design, including high-bandwidth GPU interconnects and global traffic routing.
-
Experience with infrastructure modeling and simulation for capacity planning.
-
Familiarity with Slurm, Ray, or other HPC/ML job scheduling systems.
-
Understanding of power, cooling, and physical infrastructure considerations for GPU-dense deployments.
Benefits
-
Holistic health: Medical, dental, vision benefits.
-
Annual wellness stipend.
-
Mental health support.
-
Life, STD, LTD Income Insurance Plans.
-
Unlimited PTO.
-
Generous paid parental leave.
-
Flexible schedule.
-
12 Paid US company holidays.
-
Quarterly personal productivity stipend.
-
One-time stipend for home office upgrades.
-
401(k) plan with company match.
-
Tax Savings Programs.
-
Learning / Education stipend.
-
Participation in talks and conferences.
-
Employee Resource Groups.
-
AI enablement workshops / sessions.