Code

Code

All code lives in the repo at git.catalystgroup.tech/herman/swarm-hermanity.

Files

swarm-hermanity/
├── scripts/
│   ├── dispatch_swarm.py     # Parallel subagent harness (the meat)
│   ├── analyze_swarm.py      # JSONL → SVG charts + tables
│   └── judge.py              # LLM-judge scoring
├── tasks/
│   ├── code_review/
│   │   ├── prompt.txt
│   │   ├── fixture.py        # the 2K-LOC code under review
│   │   └── rubric.md
│   ├── doc_generation/...
│   ├── test_generation/...
│   ├── refactor/...
│   └── design_doc/...
├── data/
│   └── trials.jsonl          # gitignored, 75 lines after Phase 1
└── (hugo site)

dispatch_swarm.py

Takes (model, task_id, concurrency, rep) and runs N parallel subagent calls. Returns a trial record with all measurements.

python3 scripts/dispatch_swarm.py \
  --model minimax-m3 \
  --task code_review \
  --concurrency 3 \
  --rep 1 \
  --output data/trials.jsonl

Key features:

  • Uses concurrent.futures.ThreadPoolExecutor for true parallelism
  • Each worker is an independent subprocess / HTTP request
  • Tracks per-agent (tokens, cost, latency) via provider response headers
  • Records wall-clock from first worker spawned to last worker done
  • Fail-soft: one worker failing doesn’t kill the trial

analyze_swarm.py

Reads data/trials.jsonl, computes per-concurrency aggregates, renders SVG charts into static/img/charts/.

python3 scripts/analyze_swarm.py \
  --input data/trials.jsonl \
  --output-dir static/img/charts/

Produces:

  • throughput-vs-n.svg
  • speedup-vs-n.svg
  • cost-vs-n.svg
  • quality-vs-n.svg
  • cross-model-comparison.svg

judge.py

Per-trial LLM-judge scoring on three dimensions (correctness / completeness / concision). Uses the same model tier as the trial.

python3 scripts/judge.py \
  --input data/trials.jsonl \
  --output data/trials-judged.jsonl

Task fixtures

Each task has a self-contained fixture directory with:

  • The input code/doc/spec (pinned at a specific git SHA for reproducibility)
  • The expected output shape
  • A grading rubric
  • A known-good baseline output (for judge calibration)

Fixtures are designed to be small enough to fit in a single context window but realistic enough that “code review” produces real findings (not “looks good, ship it”).

Reproducing the benchmark

git clone git@git.catalystgroup.tech:herman/swarm-hermanity.git
cd swarm-hermanity

# Run Phase 1 sweep (75 trials, ~30-60 min on minimax-m3)
python3 scripts/run_phase1.py

# Generate charts
python3 scripts/analyze_swarm.py

# Re-render the site
hugo --minify

run_phase1.py loops over the full (task × concurrency × rep) matrix with retry on transient failures and writes results to data/trials.jsonl.