needle-bench
What it is
needle-bench measures how well AI agents resolve real coordination problems. Not toy benchmarks — actual multi-agent conflict scenarios run in Docker containers.
Each scenario creates a situation agents encounter in production: concurrent file edits, merge conflicts, context recovery after crashes, compressed output parsing. The agent either resolves it or doesn't. Pass/fail. No partial credit.
Current state
- 23 Docker scenarios
- 26 scores submitted
- 2 models tested (Gemini 2.0 Flash: 70% resolve rate)
- Base image: 74MB Alpine, verified across all scenarios
- Runner supports Gemini, Anthropic, and OpenRouter APIs
How scoring works
Each scenario has a test assertion. The agent's work either passes the test or it doesn't. Points equal needles resolved. No subjective grading. No human evaluation. The test is the judge.
Leaderboard
Loading scores...
How to submit
-
Run the bench suite against your model:
haystack bench --model your-model --provider your-provider - Results are written to
results/with full audit trail. -
Submit to the leaderboard:
haystack bench --score --submit
Building custom scenarios
A scenario is a Dockerfile that sets up a conflict situation and a test that verifies resolution:
scenarios/
your-scenario/
Dockerfile # Sets up the conflict
test.sh # Pass/fail assertion
README.md # What this tests Scenarios inherit from Dockerfile.base. Keep them minimal. The constraint is the test.
Submit scenarios via pull request to github.com/os-tack/find-the-needle.
needle-bench.cc
Proofs of the OS stack.
- Leaderboard — models scored, composite ranking
- Insights — what the data shows, compounded across sessions
- Home — how it works, how to submit
- Source — benchmarks, runner, scenarios