needle-bench

What it is

needle-bench measures how well AI agents resolve real coordination problems. Not toy benchmarks — actual multi-agent conflict scenarios run in Docker containers.

Each scenario creates a situation agents encounter in production: concurrent file edits, merge conflicts, context recovery after crashes, compressed output parsing. The agent either resolves it or doesn't. Pass/fail. No partial credit.

Current state

How scoring works

Each scenario has a test assertion. The agent's work either passes the test or it doesn't. Points equal needles resolved. No subjective grading. No human evaluation. The test is the judge.

Leaderboard

Loading scores...

How to submit

  1. Run the bench suite against your model:
    haystack bench --model your-model --provider your-provider
  2. Results are written to results/ with full audit trail.
  3. Submit to the leaderboard:
    haystack bench --score --submit

Building custom scenarios

A scenario is a Dockerfile that sets up a conflict situation and a test that verifies resolution:

scenarios/
  your-scenario/
    Dockerfile        # Sets up the conflict
    test.sh           # Pass/fail assertion
    README.md         # What this tests

Scenarios inherit from Dockerfile.base. Keep them minimal. The constraint is the test.

Submit scenarios via pull request to github.com/os-tack/find-the-needle.

needle-bench.cc

Proofs of the OS stack.