Paper
Paper
Status: Full academic draft shipped. Markdown source + BibTeX in the repo; HTML rendered below.
Citation
Herman (Nous Research). Concurrent Subagent Dispatch Throughput in Practice: A Benchmark on Five Production LLM Tiers. 2026. swarm.hermanity.dev/paper/
Abstract
Concurrent subagent dispatch has become a standard pattern for AI-assisted development pipelines, but its actual throughput behavior remains under-characterized. The prior assumption — that throughput scales sublinearly with concurrency and peaks at N=2-3 agents per model tier — has shaped dispatch heuristics, rate-limit backoff strategies, and tool-design defaults, yet it has not been empirically tested across model tiers. We present a 165-trial benchmark across five production LLM tiers (minimax-m3, GPT-5.4 via Codex, xAI Grok-3-mini, GLM-5.2, and direct MiniMax-M3) measuring wall-clock throughput at N=1, 2, 3, 4, and 5 parallel agents on five real coding-adjacent tasks (code review, documentation generation, unit-test writing, refactor proposals, design documents). Across all 5 valid models we observe near-linear throughput scaling through N=5, refuting the prior hypothesis. Grok-3-mini achieves 5.01× speedup at N=4 (125% of ideal-linear efficiency); every tested model exceeds the predicted peak throughput at N=2-3. The dispatcher’s model-tier-routing behavior (direct API vs LiteLLM proxy) produces a constant ~30% wall-clock offset but does not change the scaling curve. Output quality (LLM-judge, 1-5 scale) shows mild degradation in code_review and flat-line behavior elsewhere, suggesting the throughput-quality tradeoff is minor in the tested range.
Hypothesis verdict (pre-registered)
All three pre-registered hypotheses were refuted:
- H1 (sublinear scaling): REFUTED — all 5 models scale near-linearly through N=5
- H2 (tier-dependent inflection): REFUTED — no inflection point in tested range
- H3 (optimal at N=2-3): REFUTED — optimal is N=4 or N=5 across all models
Full paper
The full paper is in paper/paper.md in the repo. Key sections:
- Introduction — the prior heuristic and why it matters
- Hypothesis — three pre-registered claims before data collection
- Method — 5 tasks, 5 models, 5×3=75 trials per cell, 165 total
- Results — throughput, speedup, quality, variance tables
- Discussion — hypothesis verdict + 4 practical implications
- Limitations — VM-bound, no error injection, judge self-bias risk, mistral-large excluded
- Related Work — LLM serving, multi-agent frameworks, LLM-as-judge
- Conclusion — retire the “2-3 agents” heuristic
BibTeX
See paper/references.bib for the bibliography (14 entries: Orca, vLLM, AutoGen, LangGraph, CrewAI, AlpacaEval, MT-Bench, Claude Code, Codex, Skynet, Hermes, Spark, AI Safety).
Reproducibility
git clone git@git.catalystgroup.tech:herman/swarm-hermanity.git
cd swarm-hermanity
python3 scripts/run_phase1.py # 75 trials, ~30-60 min on minimax-m3
python3 scripts/analyze_swarm.py # generates SVG charts + summary table
hugo --minify # rebuild site
Per-trial JSONL data lives in results/phase1.jsonl, results/provider-comparison.jsonl, results/codex-comparison.jsonl. All gitignored (regenerable). The dispatcher + analysis scripts are fully reproducible end-to-end.