Human Baseliner for Open-Ended ML Research Tasks

Mercor (client confidential) · Remote

Pay: $75–90/hr
Commitment: hourly
Hours / week: ~40
Source: mercor

Apply on mercorGoes straight to the source — we never paywall the apply link.

About this role

## Overview We are hiring experienced machine learning engineers and researchers to serve as **human baseliners** for evaluations of open-ended machine learning research tasks. These evaluations measure how well AI agents perform on realistic AI R&D problems. To interpret agent performance, we also need strong human reference points: skilled practitioners attempting the same tasks under the same time and compute constraints. As a baseliner, you will complete self-contained ML research tasks in a sandboxed environment, working independently with your preferred tools and workflow. Your performance will be used as a benchmark against which frontier-model agents are evaluated. ## What You’ll Do - Attempt open-ended machine learning research tasks under a fixed time and compute budget (work trial) - Work independently in a sandboxed Linux environment with internet access - Use your preferred tooling, including IDEs and AI coding assistants such as Cursor, Claude Code, and ChatGPT - Record your full working session via screen recording - Complete a short pre-task and post-task questionnaire - Submit your final work product, screen recording, and completed questionnaires: Post this you will be hired for a longer commitment ## Commitment - Minimum **20 hours per week if selected** - More availability is strongly preferred ## Requirements Candidates must meet **all** of the following: - **3+ years of machine learning experience** - Time spent in a PhD program counts toward this requirement - Undergraduate and master’s experience does not count - Attended a **top-100 university** or worked at **FAANG or a comparable company** - Experience with at least one major ML framework such as **PyTorch, JAX, or TensorFlow** - Deep, hands-on expertise in at least one of the focus areas below: - Pretraining under tight data and compute budgets - PPO, reward shaping, custom `gym` / `gymnasium` environments, and throughput tuning - Full fine-tuning, LoRA, QLoRA, DPO, RLHF, RLAIF, and distillation - Large-scale corpus filtering, deduplication, subsampling, and benchmark contamination avoidance - Architecture design under strict parameter-count or size constraints - Modifying pretrained architectures, including attention patterns, pooling heads, or training objectives - Contrastive training for embedding or retrieval models - Generative vision or video modeling - Multilingual or low-resource language experience - Image or video data pipelines at scale - Experience balancing competing model objectives such as safety and capability - Prior work as an ML evaluator, red-teamer, or baseliner ## Required Domain Expertise Candidates must have strong practical experience in **at least one** of the following: - **Pretraining**: training transformer language models from scratch - **Reinforcement learning**: training agents in custom or existing environments - **Post-training**: fine-tuning and aligning LLMs - **Dataset curation**: building and cleaning large text corpora for LLM training - **Model architecture**: designing and modifying neural network architectures ## Logistics (work trial requirements) - One baseline attempt per contractor per task - Each task may only be attempted once by a given contractor - All work is confidential and covered by NDA - Compute and environment are provided; no personal GPU is required

Skills & domains

ai-training
rlhf
sme
annotation
Data Analysis