[Hiring] Research Scientist, Benchmarks & Evaluations @Protege
Research Scientist, Benchmarks & Evaluations @Protege
Research
Salary unspecified
Remote Location
🇺🇸 USA Only
Employment Type full-time
Posted 3d ago

[Hiring] Research Scientist, Benchmarks & Evaluations @Protege

3d ago - Protege is hiring a remote Research Scientist, Benchmarks & Evaluations. 💸 Salary: unspecified 📍Location: USA

Role Description

Benchmarks decide what AI gets built. Today, most evals don’t measure what we actually care about — they’re contaminated, gameable, synthetic or measure capabilities that don’t transfer to the real tasks frontier models are deployed against. We’re hiring a Research Scientist to lead the design of benchmarks and evaluations that frontier labs, enterprises, and policymakers can actually trust.

You’ll own the science of evaluation across DataLab — designing tasks that meaningfully separate models, validating those tasks against human baselines, and pressure-testing them for contamination, elicitation gaps, and statistical noise. You’ll publish, and your work will directly shape the eval datasets Protege delivers to the most ambitious teams in AI.

  • Design tasks and benchmarks that distinguish capability levels across frontier models — including agentic, reasoning-heavy, and domain-specific (healthcare, finance, scientific) settings.
  • Validate evaluations rigorously: run human baselines, analyze inter-rater reliability, study how elicitation and scaffolding shift results, and quantify what’s signal versus noise.
  • Develop the “science of evals” at Protege — including item response theory, contamination analysis, predictive validity studies, and statistical frameworks for comparing models with appropriate uncertainty.
  • Run evaluations on current frontier models, sometimes in collaboration with partners at AI labs, enterprises, and government.
  • Publish research that establishes Protege as the standard-setter for evaluation data, and contribute to the broader AI community’s understanding of what good evals look like.
  • Translate findings into product , working closely with the data and engineering teams to turn research into evaluation datasets customers can deploy.
  • Partnering with outsourced annotation vendors - Evaluation data is only as good as the people producing it. A meaningful share of this role is owning the statistical machinery that determines which annotators we trust, on which tasks, and by how much — and translating that into trustworthiness scores Protege’s customers can rely on.

Qualifications

  • Advanced degree (PhD preferred, or MS/BS plus equivalent industry experience) in a quantitative field — applied econometrics with AI experience, quantitative finance, computer science, engineering, statistics/mathematics or any applied research discipline.
  • Hands-on experience evaluating LLMs, agents, or other ML systems — including prompting, scaffolding, and fluency with the tooling researchers use to run evals at scale.
  • Experience with annotator quality and inter-rater reliability — designing labeling protocols, computing agreement statistics, and reasoning about annotator bias and calibration.
  • Excellent scientific writing and communication — you can synthesize technical findings into narratives that frontier labs, enterprise customers, and policymakers can act on.
  • A bias toward velocity. You know which pipelines need to be production-grade and which can be scrappy, and you get reliable results fast.

Requirements

  • Experience with RL evaluation techniques — reward modeling, off-policy evaluation, evals for RLHF/RLAIF or agentic RL pipelines.
  • Ability to navigate new customer architectures, data systems, and requirements quickly.
  • Experience with latent-variable models of annotator skill (Dawid-Skene, MACE, IRT-style approaches) or with running large expert-annotator panels in regulated domains.
  • Track record of published benchmarks or evaluation papers the field has adopted.
Before You Apply
🇺🇸 Be aware of the location restriction for this remote position: USA Only
Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Research Scientist, Benchmarks & Evaluations @Protege
Research
Salary unspecified
Remote Location
🇺🇸 USA Only
Employment Type full-time
Posted 3d ago
Apply for this position
Did not apply
Applied
Sent Follow-Up
Interview Scheduled
Interview Completed
Offer Accepted
Offer Declined
Application Denied
Unlock 160,000+ Remote Jobs
🇺🇸 Be aware of the location restriction for this remote position: USA Only
Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Apply for this position
Did not apply
Applied
Sent Follow-Up
Interview Scheduled
Interview Completed
Offer Accepted
Offer Declined
Application Denied
Unlock 160,000+ Remote Jobs
×

Apply to the best remote jobs
before everyone else

Access 160,000+ vetted remote jobs and get daily alerts.

4.9 ★★★★★ from 500+ reviews
Unlock All Jobs Now

Maybe later