Role Description
Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving AI systems. Participation is project-based, not permanent employment.
What this opportunity involves:
-
Building a dataset to evaluate AI coding agents β how well a model handles real-world developer tasks.
-
Creating challenging tasks and evaluation criteria within realistic simulated environments:
-
Build virtual companies following a high-level plan - codebase, infrastructure, and context (conversations, documentation, tickets) that form a realistic environment with development history.
-
Assemble and calibrate tasks from intermediate states of the virtual company: craft the prompt, define evaluation criteria, and ensure the task is solvable and the evaluation is fair.
-
Design tasks set in isolated environments - emulations of a developer's workstation: a Linux machine with development tools (terminal, CLI), MCP servers (repository, task tracker, messenger, documentation, etc.), and a real web application codebase.
-
Write tests that accept all correct solutions and reject incorrect ones - neither too strict (breaking on valid approaches) nor too lenient (passing bad ones).
-
Iterate with an AI agent on tests - verifying they catch real problems, don't miss bad solutions, and don't break on good ones.
-
Review code written by agents, analyze why an agent failed or succeeded, and design edge cases and adversarial scenarios.
-
Iterate based on feedback from expert QA reviewers who score your work on quality criteria.
What this is NOT:
-
Not data labeling.
-
Not prompt engineering.
-
Not writing code from scratch - the agent writes most of the code; you guide and evaluate.
-
A significant part of the work is done together with AI - it's very hard to create tasks that challenge frontier models without using frontier models.
Qualifications
-
Degree in Computer Science, Software Engineering, or related fields.
-
5+ years in software development, primarily Python (FastAPI, pytest, async/await, subprocess, file operations).
-
Background in full-stack development, with experience building React-based interfaces (JavaScript/TypeScript) and robust back-end systems.
-
Experience writing tests (functional, integration β not just running them).
-
Docker containers, and familiarity with infrastructure tools (Postgres, Kafka, Redis).
-
CI/CD understanding (GitHub Actions as a user: triggers, labels, reading results).
-
English proficiency - B2.
Requirements
-
You don't need to be an expert in every item, but you should be comfortable reading and reasoning about code across the stack.
Benefits
-
Tasks for this project are estimated to take 20 hours to complete, depending on complexity. This is an estimate and not a schedule requirement; you choose when and how to work.
-
Tasks must be submitted by the deadline and meet the listed acceptance criteria to be accepted.
-
On this project, contributors can earn up to $50 per hour equivalent, depending on their level and pace of contribution.
-
Compensation varies across projects depending on scope, complexity, and required expertise. Please note that other projects on the platform may offer different earning levels based on their requirements.