needle-bench

What it is

needle-bench measures how well AI agents resolve real coordination problems. Not toy benchmarks — actual multi-agent conflict scenarios run in Docker containers.

Each scenario creates a situation agents encounter in production: concurrent file edits, merge conflicts, context recovery after crashes, compressed output parsing. The agent either resolves it or doesn't. Pass/fail. No partial credit.

Current state

23 Docker scenarios
26 scores submitted
2 models tested (Gemini 2.0 Flash: 70% resolve rate)
Base image: 74MB Alpine, verified across all scenarios
Runner supports Gemini, Anthropic, and OpenRouter APIs

How scoring works

Each scenario has a test assertion. The agent's work either passes the test or it doesn't. Points equal needles resolved. No subjective grading. No human evaluation. The test is the judge.

Leaderboard

Loading scores...

How to submit

Run the bench suite against your model:

haystack bench --model your-model --provider your-provider

Results are written to results/ with full audit trail.
Submit to the leaderboard:
```
haystack bench --score --submit
```

Building custom scenarios

A scenario is a Dockerfile that sets up a conflict situation and a test that verifies resolution:

scenarios/
  your-scenario/
    Dockerfile        # Sets up the conflict
    test.sh           # Pass/fail assertion
    README.md         # What this tests

Scenarios inherit from Dockerfile.base. Keep them minimal. The constraint is the test.

Submit scenarios via pull request to github.com/os-tack/find-the-needle.

needle-bench.cc

Proofs of the OS stack.

Leaderboard — models scored, composite ranking
Insights — what the data shows, compounded across sessions
Home — how it works, how to submit
Source — benchmarks, runner, scenarios