[Hiring] ML Lead, AI Data Labeling @NewtonX
ML Lead, AI Data Labeling @NewtonX
Artificial Intelligence
Salary usd 180,000 - 2..
Remote Location
πŸ‡ΊπŸ‡Έ USA Only
Employment Type full-time
Posted 1wk ago

[Hiring] ML Lead, AI Data Labeling @NewtonX

1wk ago - NewtonX is hiring a remote ML Lead, AI Data Labeling. πŸ’Έ Salary: usd 180,000 - 260,000 per year πŸ“Location: USA

Role Description

AI buyers have changed. From mid-market SaaS companies fine-tuning open-source models to Fortune 500 enterprises building internal AI platforms to frontier AI labs running large-scale evaluations, the question is no longer β€œis AI useful” but β€œhow do we evaluate whether our AI works?” Every one of these buyers needs structured, expert-grounded evaluation data and domain-specific benchmarks. Almost none of them can build it themselves.

That is the opportunity as ML Lead. Rolling up directly to the VP of Commercial, you are the technical counterpart to ML and product teams across our client base, spanning growth-stage AI companies, enterprise AI platforms, and frontier research labs. You sit in their working sessions, hold your ground on technical specifics (eval design, statistical significance, contamination concerns, inter-annotator reliability), translate what they actually need into concrete operational specs, and partner with our recruiting and ops lead to build the expert pipelines that produce defensible data.

You also build. Beyond bespoke client work, you own the design and development of NewtonX domain benchmarks across high-value verticals (finance, legal, healthcare, and others as we expand). These become both syndicated products and methodological proof points that move us up the client sophistication curve.

And you sell, lightly but meaningfully. You are on client calls. You hear gaps. You spot opportunities other vendors miss. You bring those back, shape them into pitches, and partner with Commercial to expand accounts.

In this role you'll focus on:

  • Client Technical Partnership
    • Serve as the primary technical point of contact for ML, applied science, and product teams at our AI-focused clients across the maturity spectrum, from emerging AI companies to enterprise platforms to frontier labs.
    • Hold your own in technical conversations: eval design, dataset construction, contamination risk, statistical power, inter-annotator agreement, RLHF data quality, agentic evaluation, red-teaming methodology.
    • Translate ambiguous technical requirements into concrete operational specs: target expert profiles, screener trees, task design, annotation rubrics, quality control protocols, statistical sampling plans.
    • Calibrate depth to the audience. A Series B AI startup and a frontier lab need different conversations. You can run both.
  • Domain Benchmark Development
    • Design and build domain benchmarks for NewtonX-owned domains in high-value verticals. Initial targets: finance (markets, accounting, regulatory), legal (contracts, case reasoning, jurisdictional), healthcare (clinical reasoning, diagnostic, regulatory). Additional verticals as the business expands.
    • Architect benchmark structure: task taxonomy, difficulty distribution, expert involvement model, evaluation rubrics, scoring protocols, baseline scoring against frontier models.
    • Recruit and calibrate the domain experts who write, validate, and grade benchmark tasks. Work with our recruiting and ops lead to operationalize at scale.
    • Publish methodology papers, technical reports, and leaderboards that make NewtonX benchmarks the reference standard in their verticals.
  • Operationalization with NewtonX Recruiting and Ops
    • Work directly with our full-time recruiting and operations lead to convert client and benchmark requirements into operational specs: expert profiles, screeners, task interfaces, annotation workflows, QC sampling rates, and fielding timelines.
    • Calibrate the recruiting team on what β€œgood” looks like for each engagement. Run alignment sessions when standards shift.
    • Own the technical feedback loop: when an expert clears screening, but their output is unusable, diagnose whether it is a screener problem, a task-design problem, or a calibration problem, and fix it upstream.
    • Define quality control metrics: inter-annotator agreement targets, gold-standard task injection rates, and statistical power thresholds. Hold the team accountable to them.
  • Commercial Partnership and Account Expansion
    • Sit in client calls alongside Commercial leads. Surface technical gaps and unsolved problems that the client has not yet asked us to address.
    • Translate gaps into concrete proposal narratives: scope, methodology, deliverables, defensibility. Hand off to Commercial for pricing and close.
    • Contribute to NewtonX positioning with AI buyers: case studies, technical blog posts, conference presence at applied AI and industry events.
    • Help shape what additional ML and research roles we hire as the AI account book and benchmark program grow.

Qualifications

  • 5 to 8 years of applied ML experience with substantive evaluation, benchmark, or human data work.
  • Working fluency with modern LLM evaluation: benchmark design, contamination handling, statistical significance, eval harness construction, agentic and tool-use evaluation, RLHF and preference data quality, red-team probe design.
  • Strong programming foundation: able to read and reason about an eval harness, write Python comfortably, work with model APIs, and prototype scoring pipelines.
  • Statistical fluency: able to defend a sample size choice or a significance threshold.
  • Demonstrated client-facing presence: able to present technical work to skeptical audiences and defend design choices in real time.
  • Light commercial instinct: able to spot opportunities and shape them into a pitch.
  • Strong written communication: able to write methodology sections, benchmark technical reports, or client proposals that hold up to expert review.

Requirements

  • Direct experience designing or contributing to an LLM benchmark or evaluation system (academic, open-source, or proprietary).
  • Domain depth in one or more of: finance, legal, healthcare, scientific reasoning, and software engineering.
  • Exposure to expert-driven data work: RLHF pipelines, preference data collection, expert annotation programs, red-team operations, and evaluation contractor management.
  • Graduate degree in computer science, machine learning, statistics, or a related quantitative field.
  • Publications or open-source contributions in evaluation, benchmarking, or applied ML methodology.

Benefits

  • Opportunity to have an astounding impact, build a brand new business unit from the ground up, and have direct C-level influence at an extremely fast-growing late-stage startup.
  • This foundational role will enable you to progress quickly within NewtonX towards commercial and operational leadership.
  • Excellent medical, dental, and vision insurance.
  • 401k match with immediate vesting.
  • Health savings/flexible savings account, and pre-tax commuter benefits.
  • Paid time off: vacation, holidays, sick, and parental leave.
  • A diverse, collaborative, and positive culture where we invest in and celebrate each other's success (happy hours, team projects, and retreats).
Before You Apply
️
πŸ‡ΊπŸ‡Έ Be aware of the location restriction for this remote position: USA Only
β€Ό Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
ML Lead, AI Data Labeling @NewtonX
Artificial Intelligence
Salary usd 180,000 - 2..
Remote Location
πŸ‡ΊπŸ‡Έ USA Only
Employment Type full-time
Posted 1wk ago
Apply for this position
Did not apply βœ“
Applied βœ“
Sent Follow-Up βœ“
Interview Scheduled βœ“
Interview Completed βœ“
Offer Accepted βœ“
Offer Declined βœ“
Application Denied βœ“
Unlock 160,000+ Remote Jobs
️
πŸ‡ΊπŸ‡Έ Be aware of the location restriction for this remote position: USA Only
β€Ό Beware of scams! When applying for jobs, you should NEVER have to pay anything. Learn more.
Apply for this position
Did not apply βœ“
Applied βœ“
Sent Follow-Up βœ“
Interview Scheduled βœ“
Interview Completed βœ“
Offer Accepted βœ“
Offer Declined βœ“
Application Denied βœ“
Unlock 160,000+ Remote Jobs
Γ—

Apply to the best remote jobs
before everyone else

Access 160,000+ vetted remote jobs and get daily alerts.

4.9 β˜…β˜…β˜…β˜…β˜… from 500+ reviews
Unlock All Jobs Now

Maybe later