This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more.
Role Description
As a Machine Learning Researcher at Bland, you'll be working on foundational research and development across the core components of our voice stack: speech-to-text, large language models, neural audio codecs, and text-to-speech. Your work will define how our agents understand, reason, and speak in real time at enterprise scale.
-
Build and Scale Next-Generation TTS Systems
-
Design and train large scale text-to-speech models capable of expressive, controllable, human-sounding output.
-
Develop neural audio codec-based TTS architectures for efficient, high-fidelity generation.
-
Improve prosody modeling, question inflection, emotional expression, and multi-speaker robustness.
-
Optimize for real-time, low-latency inference in production.
-
Advance Speech-to-Text Modeling
-
Build and fine-tune large scale ASR systems robust to accents, noise, telephony artifacts, and code switching.
-
Leverage self-supervised pretraining and large-scale weak supervision.
-
Improve transcription accuracy for real-world enterprise scenarios, including structured extraction and conversational nuance.
-
Pioneer Neural Audio Codecs
-
Research and implement neural audio codecs that achieve extreme compression with minimal perceptual loss.
-
Explore discrete and continuous latent representations for scalable speech modeling.
-
Design codec architectures that enable downstream generative modeling and controllable synthesis.
-
Develop Scalable Training Pipelines
-
Curate and process massive audio datasets across languages, speakers, and environments.
-
Design staged training curricula and data filtering strategies.
-
Scale training across distributed GPU clusters focusing on cost, throughput, and reliability.
-
Run Rigorous Experiments
-
Design ablation studies that isolate the impact of architectural changes.
-
Measure improvements using both objective metrics and perceptual evaluations.
-
Validate ideas quickly through focused experiments that confirm or eliminate hypotheses.
Qualifications
-
Experience with self-supervised learning, multimodal modeling, or generative modeling.
-
Hands-on experience building or scaling TTS, STT, or neural audio codec systems.
-
Familiarity with large scale speech datasets and real-world audio variability.
-
Experience training and serving large models on modern accelerators.
-
Track record of designing controlled experiments and meaningful ablations.
-
Comfortable in fast-moving startup environments.
Requirements
-
Ability to derive new formulations and implement them efficiently.
-
Strong intuition for audio quality, prosody, and conversational dynamics.
-
Knowledge of inference optimization techniques, including quantization, kernel optimization, and memory efficiency.
-
Understanding of real-time constraints in telephony or streaming environments.
-
Ability to move quickly from hypothesis to validation.
-
Strong ownership mindset from research through deployment.
-
Excited by ambiguous, unsolved problems.
Benefits
-
Healthcare, dental, vision, all the good stuff
-
Meaningful equity in a fast-growing company
-
Every tool you need to succeed
-
Beautiful office in Jackson Square, SF with rooftop views
-
Competitive salary: $160,000 to $250,000