Code
Code
All code lives in the repo at git.catalystgroup.tech/herman/swarm-hermanity.
Files
swarm-hermanity/
├── scripts/
│ ├── dispatch_swarm.py # Parallel subagent harness (the meat)
│ ├── analyze_swarm.py # JSONL → SVG charts + tables
│ └── judge.py # LLM-judge scoring
├── tasks/
│ ├── code_review/
│ │ ├── prompt.txt
│ │ ├── fixture.py # the 2K-LOC code under review
│ │ └── rubric.md
│ ├── doc_generation/...
│ ├── test_generation/...
│ ├── refactor/...
│ └── design_doc/...
├── data/
│ └── trials.jsonl # gitignored, 75 lines after Phase 1
└── (hugo site)
dispatch_swarm.py
Takes (model, task_id, concurrency, rep) and runs N parallel subagent calls. Returns a trial record with all measurements.
python3 scripts/dispatch_swarm.py \
--model minimax-m3 \
--task code_review \
--concurrency 3 \
--rep 1 \
--output data/trials.jsonl
Key features:
- Uses
concurrent.futures.ThreadPoolExecutorfor true parallelism - Each worker is an independent subprocess / HTTP request
- Tracks per-agent (tokens, cost, latency) via provider response headers
- Records wall-clock from first worker spawned to last worker done
- Fail-soft: one worker failing doesn’t kill the trial
analyze_swarm.py
Reads data/trials.jsonl, computes per-concurrency aggregates, renders SVG charts into static/img/charts/.
python3 scripts/analyze_swarm.py \
--input data/trials.jsonl \
--output-dir static/img/charts/
Produces:
throughput-vs-n.svgspeedup-vs-n.svgcost-vs-n.svgquality-vs-n.svgcross-model-comparison.svg
judge.py
Per-trial LLM-judge scoring on three dimensions (correctness / completeness / concision). Uses the same model tier as the trial.
python3 scripts/judge.py \
--input data/trials.jsonl \
--output data/trials-judged.jsonl
Task fixtures
Each task has a self-contained fixture directory with:
- The input code/doc/spec (pinned at a specific git SHA for reproducibility)
- The expected output shape
- A grading rubric
- A known-good baseline output (for judge calibration)
Fixtures are designed to be small enough to fit in a single context window but realistic enough that “code review” produces real findings (not “looks good, ship it”).
Reproducing the benchmark
git clone git@git.catalystgroup.tech:herman/swarm-hermanity.git
cd swarm-hermanity
# Run Phase 1 sweep (75 trials, ~30-60 min on minimax-m3)
python3 scripts/run_phase1.py
# Generate charts
python3 scripts/analyze_swarm.py
# Re-render the site
hugo --minify
run_phase1.py loops over the full (task × concurrency × rep) matrix with retry on transient failures and writes results to data/trials.jsonl.