Subagent Swarm

Analysis

Sat, 04 Jul 2026 00:00:00 +0000

Analysis

Status: Analysis complete. Original hypothesis (peak at N=2-3) was REFUTED. All 5 valid models showed near-linear throughput scaling through N=5.

The key chart

The cross-model throughput chart shows N/mean_wall for each model across concurrency levels 1–5. The dashed ideal line shows linear scaling from a 5× N=1 baseline.

Per-model throughput curves

xai/grok-3-mini (Skynet)

The throughput leader. N=4 hits 125% efficiency (5.01× speedup), meaning wall-clock at N=4 is faster than N=1 baseline implies. Skynet’s Grok routing is genuinely faster than the model would suggest from N=1 alone.

Code

Sat, 04 Jul 2026 00:00:00 +0000

Code

All code lives in the repo at git.catalystgroup.tech/herman/swarm-hermanity.

Files

swarm-hermanity/
├── scripts/
│ ├── dispatch_swarm.py # Parallel subagent harness (the meat)
│ ├── analyze_swarm.py # JSONL → SVG charts + tables
│ └── judge.py # LLM-judge scoring
├── tasks/
│ ├── code_review/
│ │ ├── prompt.txt
│ │ ├── fixture.py # the 2K-LOC code under review
│ │ └── rubric.md
│ ├── doc_generation/...
│ ├── test_generation/...
│ ├── refactor/...
│ └── design_doc/...
├── data/
│ └── trials.jsonl # gitignored, 75 lines after Phase 1
└── (hugo site)

`dispatch_swarm.py`

Takes (model, task_id, concurrency, rep) and runs N parallel subagent calls. Returns a trial record with all measurements.

Methodology

Sat, 04 Jul 2026 00:00:00 +0000

Methodology

The 5 task templates

All tasks take a fixed input fixture (a real small repo / function / doc the agent has never seen) and produce output judged against a fixed rubric. This way the quality score is comparable across model tiers, not just within them.

#	Task	Input size	Output size	Why
1	Code review	2K LOC	1K review	Tests structured-analysis ability; classic use case
2	Doc generation	500 LOC	1.5K doc	Tests structured prose generation
3	Unit tests	800 LOC	1.2K tests	Tests code comprehension + test pattern knowledge
4	Refactor proposal	1K LOC	800 design	Tests architectural reasoning
5	Design doc	200 LOC	2K doc	Tests long-form synthesis from small inputs

Concurrency matrix

Each (task, concurrency) cell is run 3 times for variance estimation.

Paper

Sat, 04 Jul 2026 00:00:00 +0000

Paper

Status: Full academic draft shipped. Markdown source + BibTeX in the repo; HTML rendered below.

Citation

Herman (Nous Research). Concurrent Subagent Dispatch Throughput in Practice: A Benchmark on Five Production LLM Tiers. 2026. swarm.hermanity.dev/paper/

Abstract

Concurrent subagent dispatch has become a standard pattern for AI-assisted development pipelines, but its actual throughput behavior remains under-characterized. The prior assumption — that throughput scales sublinearly with concurrency and peaks at N=2-3 agents per model tier — has shaped dispatch heuristics, rate-limit backoff strategies, and tool-design defaults, yet it has not been empirically tested across model tiers. We present a 165-trial benchmark across five production LLM tiers (minimax-m3, GPT-5.4 via Codex, xAI Grok-3-mini, GLM-5.2, and direct MiniMax-M3) measuring wall-clock throughput at N=1, 2, 3, 4, and 5 parallel agents on five real coding-adjacent tasks (code review, documentation generation, unit-test writing, refactor proposals, design documents). Across all 5 valid models we observe near-linear throughput scaling through N=5, refuting the prior hypothesis. Grok-3-mini achieves 5.01× speedup at N=4 (125% of ideal-linear efficiency); every tested model exceeds the predicted peak throughput at N=2-3. The dispatcher’s model-tier-routing behavior (direct API vs LiteLLM proxy) produces a constant ~30% wall-clock offset but does not change the scaling curve. Output quality (LLM-judge, 1-5 scale) shows mild degradation in code_review and flat-line behavior elsewhere, suggesting the throughput-quality tradeoff is minor in the tested range.

Results

Sat, 04 Jul 2026 00:00:00 +0000

Results

Status: All three sweeps complete. 165 trials total, 0 catastrophic failures (8 mistral-large trials excluded due to user-confirmed rate-limit tier).

Sweep summary

Sweep	Trials	Failures	Cost
Phase 1 (minimax-m3, 5 tasks × 5 conc × 3 reps)	75	0	$0
Provider comparison (5 models × 5 conc × 3 reps)	75	7 (all mistral-large)	$0
Codex (gpt-5.4, code_review × 5 conc × 3 reps)	15	0	$0

Per-trial raw data lives at results/phase1.jsonl, results/provider-comparison.jsonl, results/codex-comparison.jsonl. All gitignored (regenerable).