Role Description
Benchmarks decide what AI gets built. Today, most evals don’t measure what we actually care about — they’re contaminated, gameable, synthetic or measure capabilities that don’t transfer to the real tasks frontier models are deployed against. We’re hiring a Research Scientist to lead the design of benchmarks and evaluations that frontier labs, enterprises, and policymakers can actually trust.
You’ll own the science of evaluation across DataLab — designing tasks that meaningfully separate models, validating those tasks against human baselines, and pressure-testing them for contamination, elicitation gaps, and statistical noise. You’ll publish, and your work will directly shape the eval datasets Protege delivers to the most ambitious teams in AI.
-
Design tasks and benchmarks
that distinguish capability levels across frontier models — including agentic, reasoning-heavy, and domain-specific (healthcare, finance, scientific) settings.
-
Validate evaluations
rigorously: run human baselines, analyze inter-rater reliability, study how elicitation and scaffolding shift results, and quantify what’s signal versus noise.
-
Develop the “science of evals”
at Protege — including item response theory, contamination analysis, predictive validity studies, and statistical frameworks for comparing models with appropriate uncertainty.
-
Run evaluations
on current frontier models, sometimes in collaboration with partners at AI labs, enterprises, and government.
-
Publish
research that establishes Protege as the standard-setter for evaluation data, and contribute to the broader AI community’s understanding of what good evals look like.
-
Translate findings into product
, working closely with the data and engineering teams to turn research into evaluation datasets customers can deploy.
-
Partnering with outsourced annotation vendors
- Evaluation data is only as good as the people producing it. A meaningful share of this role is owning the statistical machinery that determines which annotators we trust, on which tasks, and by how much — and translating that into trustworthiness scores Protege’s customers can rely on.
Qualifications
-
Advanced degree (PhD preferred, or MS/BS plus equivalent industry experience) in a quantitative field — applied econometrics with AI experience, quantitative finance, computer science, engineering, statistics/mathematics or any applied research discipline.
-
Hands-on experience evaluating LLMs, agents, or other ML systems — including prompting, scaffolding, and fluency with the tooling researchers use to run evals at scale.
-
Experience with annotator quality and inter-rater reliability — designing labeling protocols, computing agreement statistics, and reasoning about annotator bias and calibration.
-
Excellent scientific writing and communication — you can synthesize technical findings into narratives that frontier labs, enterprise customers, and policymakers can act on.
-
A bias toward velocity. You know which pipelines need to be production-grade and which can be scrappy, and you get reliable results fast.
Requirements
-
Experience with RL evaluation techniques — reward modeling, off-policy evaluation, evals for RLHF/RLAIF or agentic RL pipelines.
-
Ability to navigate new customer architectures, data systems, and requirements quickly.
-
Experience with latent-variable models of annotator skill (Dawid-Skene, MACE, IRT-style approaches) or with running large expert-annotator panels in regulated domains.
-
Track record of published benchmarks or evaluation papers the field has adopted.