Back to Remote jobs > Research > research scientist

Research Scientist, Benchmarks & Evaluations @Protege

Research

Salary unspecified	Remote Location 🇺🇸 USA Only
Employment Type full-time	Posted 3d ago

[Hiring] Research Scientist, Benchmarks & Evaluations @Protege

3d ago - Protege is hiring a remote Research Scientist, Benchmarks & Evaluations. 💸 Salary: unspecified 📍Location: USA

Role Description

Benchmarks decide what AI gets built. Today, most evals don’t measure what we actually care about — they’re contaminated, gameable, synthetic or measure capabilities that don’t transfer to the real tasks frontier models are deployed against. We’re hiring a Research Scientist to lead the design of benchmarks and evaluations that frontier labs, enterprises, and policymakers can actually trust.

You’ll own the science of evaluation across DataLab — designing tasks that meaningfully separate models, validating those tasks against human baselines, and pressure-testing them for contamination, elicitation gaps, and statistical noise. You’ll publish, and your work will directly shape the eval datasets Protege delivers to the most ambitious teams in AI.

Design tasks and benchmarks that distinguish capability levels across frontier models — including agentic, reasoning-heavy, and domain-specific (healthcare, finance, scientific) settings.
Validate evaluations rigorously: run human baselines, analyze inter-rater reliability, study how elicitation and scaffolding shift results, and quantify what’s signal versus noise.
Develop the “science of evals” at Protege — including item response theory, contamination analysis, predictive validity studies, and statistical frameworks for comparing models with appropriate uncertainty.
Run evaluations on current frontier models, sometimes in collaboration with partners at AI labs, enterprises, and government.
Publish research that establishes Protege as the standard-setter for evaluation data, and contribute to the broader AI community’s understanding of what good evals look like.
Translate findings into product , working closely with the data and engineering teams to turn research into evaluation datasets customers can deploy.
Partnering with outsourced annotation vendors - Evaluation data is only as good as the people producing it. A meaningful share of this role is owning the statistical machinery that determines which annotators we trust, on which tasks, and by how much — and translating that into trustworthiness scores Protege’s customers can rely on.

Qualifications

Advanced degree (PhD preferred, or MS/BS plus equivalent industry experience) in a quantitative field — applied econometrics with AI experience, quantitative finance, computer science, engineering, statistics/mathematics or any applied research discipline.
Hands-on experience evaluating LLMs, agents, or other ML systems — including prompting, scaffolding, and fluency with the tooling researchers use to run evals at scale.
Experience with annotator quality and inter-rater reliability — designing labeling protocols, computing agreement statistics, and reasoning about annotator bias and calibration.
Excellent scientific writing and communication — you can synthesize technical findings into narratives that frontier labs, enterprise customers, and policymakers can act on.
A bias toward velocity. You know which pipelines need to be production-grade and which can be scrappy, and you get reliable results fast.

Requirements

Experience with RL evaluation techniques — reward modeling, off-policy evaluation, evals for RLHF/RLAIF or agentic RL pipelines.
Ability to navigate new customer architectures, data systems, and requirements quickly.
Experience with latent-variable models of annotator skill (Dawid-Skene, MACE, IRT-style approaches) or with running large expert-annotator panels in regulated domains.
Track record of published benchmarks or evaluation papers the field has adopted.

Kickstart Your Job Search

⚡ 13,432 remote jobs added this week

You're seeing 0.4% of available roles

Unlock 160,000+ jobs →

Meet JobCopilot: Your Personal Al Job Hunter

Automatically Apply to Remote Jobs

Try it now →

Before You Apply

️

🇺🇸	Be aware of the location restriction for this remote position: USA Only
‼	Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.

Back to Remote jobs > Research > research scientist

Research Scientist, Benchmarks & Evaluations @Protege

Research

Salary unspecified	Remote Location 🇺🇸 USA Only
Employment Type full-time	Posted 3d ago

Apply for this position

Unlock 160,000+ Remote Jobs

️

🇺🇸	Be aware of the location restriction for this remote position: USA Only
‼	Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.

Apply for this position

Unlock 160,000+ Remote Jobs

[Hiring] Research Scientist, Benchmarks & Evaluations @Protege

Apply to the best remote jobsbefore everyone else

Apply to the best remote jobs
before everyone else