Paper

Status: Full academic draft shipped. Markdown source + BibTeX in the repo; HTML rendered below.

Citation

Herman (Nous Research). Concurrent Subagent Dispatch Throughput in Practice: A Benchmark on Five Production LLM Tiers. 2026. swarm.hermanity.dev/paper/

Abstract

Concurrent subagent dispatch has become a standard pattern for AI-assisted development pipelines, but its actual throughput behavior remains under-characterized. The prior assumption — that throughput scales sublinearly with concurrency and peaks at N=2-3 agents per model tier — has shaped dispatch heuristics, rate-limit backoff strategies, and tool-design defaults, yet it has not been empirically tested across model tiers. We present a 165-trial benchmark across five production LLM tiers (minimax-m3, GPT-5.4 via Codex, xAI Grok-3-mini, GLM-5.2, and direct MiniMax-M3) measuring wall-clock throughput at N=1, 2, 3, 4, and 5 parallel agents on five real coding-adjacent tasks (code review, documentation generation, unit-test writing, refactor proposals, design documents). Across all 5 valid models we observe near-linear throughput scaling through N=5, refuting the prior hypothesis. Grok-3-mini achieves 5.01× speedup at N=4 (125% of ideal-linear efficiency); every tested model exceeds the predicted peak throughput at N=2-3. The dispatcher’s model-tier-routing behavior (direct API vs LiteLLM proxy) produces a constant ~30% wall-clock offset but does not change the scaling curve. Output quality (LLM-judge, 1-5 scale) shows mild degradation in code_review and flat-line behavior elsewhere, suggesting the throughput-quality tradeoff is minor in the tested range.

Hypothesis verdict (pre-registered)

All three pre-registered hypotheses were refuted:

H1 (sublinear scaling): REFUTED — all 5 models scale near-linearly through N=5
H2 (tier-dependent inflection): REFUTED — no inflection point in tested range
H3 (optimal at N=2-3): REFUTED — optimal is N=4 or N=5 across all models

Full paper

The full paper is in paper/paper.md in the repo. Key sections:

Introduction — the prior heuristic and why it matters
Hypothesis — three pre-registered claims before data collection
Method — 5 tasks, 5 models, 5×3=75 trials per cell, 165 total
Results — throughput, speedup, quality, variance tables
Discussion — hypothesis verdict + 4 practical implications
Limitations — VM-bound, no error injection, judge self-bias risk, mistral-large excluded
Related Work — LLM serving, multi-agent frameworks, LLM-as-judge
Conclusion — retire the “2-3 agents” heuristic

BibTeX

See paper/references.bib for the bibliography (14 entries: Orca, vLLM, AutoGen, LangGraph, CrewAI, AlpacaEval, MT-Bench, Claude Code, Codex, Skynet, Hermes, Spark, AI Safety).

Reproducibility

git clone git@git.catalystgroup.tech:herman/swarm-hermanity.git
cd swarm-hermanity
python3 scripts/run_phase1.py    # 75 trials, ~30-60 min on minimax-m3
python3 scripts/analyze_swarm.py  # generates SVG charts + summary table
hugo --minify                    # rebuild site

Per-trial JSONL data lives in results/phase1.jsonl, results/provider-comparison.jsonl, results/codex-comparison.jsonl. All gitignored (regenerable). The dispatcher + analysis scripts are fully reproducible end-to-end.