Code

All code lives in the repo at git.catalystgroup.tech/herman/swarm-hermanity.

Files

swarm-hermanity/
├── scripts/
│   ├── dispatch_swarm.py     # Parallel subagent harness (the meat)
│   ├── analyze_swarm.py      # JSONL → SVG charts + tables
│   └── judge.py              # LLM-judge scoring
├── tasks/
│   ├── code_review/
│   │   ├── prompt.txt
│   │   ├── fixture.py        # the 2K-LOC code under review
│   │   └── rubric.md
│   ├── doc_generation/...
│   ├── test_generation/...
│   ├── refactor/...
│   └── design_doc/...
├── data/
│   └── trials.jsonl          # gitignored, 75 lines after Phase 1
└── (hugo site)

`dispatch_swarm.py`

Takes (model, task_id, concurrency, rep) and runs N parallel subagent calls. Returns a trial record with all measurements.

python3 scripts/dispatch_swarm.py \
  --model minimax-m3 \
  --task code_review \
  --concurrency 3 \
  --rep 1 \
  --output data/trials.jsonl

Key features:

Uses concurrent.futures.ThreadPoolExecutor for true parallelism
Each worker is an independent subprocess / HTTP request
Tracks per-agent (tokens, cost, latency) via provider response headers
Records wall-clock from first worker spawned to last worker done
Fail-soft: one worker failing doesn’t kill the trial

`analyze_swarm.py`

Reads data/trials.jsonl, computes per-concurrency aggregates, renders SVG charts into static/img/charts/.

python3 scripts/analyze_swarm.py \
  --input data/trials.jsonl \
  --output-dir static/img/charts/

Produces:

throughput-vs-n.svg
speedup-vs-n.svg
cost-vs-n.svg
quality-vs-n.svg
cross-model-comparison.svg

`judge.py`

Per-trial LLM-judge scoring on three dimensions (correctness / completeness / concision). Uses the same model tier as the trial.

python3 scripts/judge.py \
  --input data/trials.jsonl \
  --output data/trials-judged.jsonl

Task fixtures

Each task has a self-contained fixture directory with:

The input code/doc/spec (pinned at a specific git SHA for reproducibility)
The expected output shape
A grading rubric
A known-good baseline output (for judge calibration)

Fixtures are designed to be small enough to fit in a single context window but realistic enough that “code review” produces real findings (not “looks good, ship it”).

Reproducing the benchmark

git clone git@git.catalystgroup.tech:herman/swarm-hermanity.git
cd swarm-hermanity

# Run Phase 1 sweep (75 trials, ~30-60 min on minimax-m3)
python3 scripts/run_phase1.py

# Generate charts
python3 scripts/analyze_swarm.py

# Re-render the site
hugo --minify

run_phase1.py loops over the full (task × concurrency × rep) matrix with retry on transient failures and writes results to data/trials.jsonl.