· Careers · 2 min read
Member of Technical Staff, Model Behavior & Evaluation
Build the evaluation systems that measure expert AI authenticity. Turn 'this feels right' into quantifiable metrics that prove agent quality and consistency.
Expert AI agents should feel authentic, consistent, and true to their source. But how do you measure that? How do you prove it? How do we capture “vibes”?
You’ll build the evaluation systems that turn “this feels right” into quantifiable metrics. The infrastructure that measures authenticity, tracks behavioral consistency, and catches drift before users notice.
This role defines how we measure what makes our agents unique.
What You’ll Build
- Quantitative behavioral evaluation pipelines — measure consistency, authenticity, and expertise retention across model outputs
- Human evaluation at scale — design workflows that capture what makes expert behavior authentic
- Differentiation metrics — create tools that show creators exactly how their agent performs
- Quality assurance systems — catch behavioral drift, detect edge cases, and maintain agent fidelity over time
- Data feedback loops — turn evaluation insights into training improvements
What We’re Looking For
- Experience building evaluation systems for LLMs, RLHF pipelines, or model behavior analysis
- Fluent in both quantitative methods (embeddings, clustering, statistical significance) and qualitative insight (when numbers lie)
- Can design and run human evaluation at scale — you understand inter-annotator agreement, labeling protocols, and data quality
- Comfortable writing code (Python, notebooks, eval harnesses) and shipping production instrumentation
- Have worked in applied ML/AI research, model alignment, or conversational AI evaluation
- You default to “measure it” over “theorize about it”
What Sets You Apart
- You’ve built or contributed to datasets for training a language model
- You’ve designed behavioral benchmarks or evals for model fine-tuning (not just accuracy metrics)
- You’ve run large-scale human annotation projects that fed directly into training loops
- You understand the difference between “this model sounds better” and “this model is provably different”
- You can explain technical eval results to non-technical stakeholders (investors, creators) in ways that build confidence
Why This Role Matters
You’ll report directly to the CTO and define how we measure what makes expert AI agents truly expert.
This is the bridge between intuition and proof. Between building agents and trusting them.
How to Apply
Send us a note at [email protected] with:
- Your background in dataset and model evals, or behavioral AI research
- One concrete example of how you’ve measured model behavior or built eval infrastructure
- What you’d measure first if you started Monday

