Subagent Swarm
The question
Every parallel agent dispatch trades wall-clock speed for API cost + coordination overhead. I dispatch Claude subagents all the time — for code review, audit tasks, sub-investigations — but I’ve never actually measured the throughput curve. Does N parallel agents give me N× throughput, or does it plateau? Where’s the inflection point? Does the curve differ by model tier?
Without a real curve, I’d been guessing. This project measures it.
The hypothesis (REFUTED)
Subagent throughput scales sublinearly with concurrency, with the inflection point depending on model tier.
- minimax-m3 (cheap): near-linear to N=5, then plateau
- Opus 4.5 (mid-tier): peaks at N=3, degrades by N=5 (rate-limit bound)
- Fable 5 (craft-tier): peaks at N=2, hard ceiling at N=3 (token-cost bound)
Net result: optimal concurrency is 2–3 agents, regardless of model tier. Above that, marginal output per dollar goes negative.
Verdict: REFUTED. Throughput scales near-linearly up to N=5 on every model tested. The original guess that “peak is at N=2–3 across all tiers” was wrong — Skynet’s LiteLLM proxy pipelines concurrent requests well enough that no inflection point appeared in the tested range. Grok-3-mini actually hit 125% efficiency at N=4 (5.01× speedup).
The setup
| Trials | 165 total: Phase 1 (75, minimax-m3) + provider comparison (75, 5 models) + codex comparison (15, codex-gpt-5.4) |
| Concurrency levels | 1, 2, 3, 4, 5 parallel agents |
| Models tested | direct-MiniMax-M3, Skynet minimax-m3, codex-gpt-5.4, xai/grok-3-mini, zai-coding/glm-5.2 |
| Per-trial metrics | Wall-clock, tokens in/out, cost, output quality (LLM-judge for Phase 1 only) |
| Total cost | $0 — all subscriptions + free tiers |
Cross-model throughput (real data)
N=5 throughput summary (mistral-large excluded from this comparison per user — pay-as-you-go tier with rate limits that ate its high-concurrency trials):
| Model | N=1 wall | N=5 wall | N=5 throughput | N=5 speedup | N=5 efficiency |
|---|---|---|---|---|---|
| xai/grok-3-mini | 27.5s | 29.3s | 0.170/s | 4.7× | 93.7% |
| MiniMax-M3 (direct) | 52.0s | 61.2s | 0.082/s | 4.3× | 84.9% |
| minimax-m3 (Skynet) | 29.3s | 46.3s | 0.108/s | 3.2× | 63.3% |
| codex-gpt-5.4 | 41.0s | 56.0s | 0.089/s | 3.7× | 73.1% |
| zai-coding/glm-5.2 | 69.4s | 100.3s | 0.050/s | 3.5× | 69.2% |
Headline findings
- Hypothesis (peak at N=2-3) was wrong. All 5 valid models showed near-linear throughput scaling through N=5, with no clear inflection point.
- Grok-3-mini is the throughput leader — 0.170 tasks/sec at N=5, the only model to beat ideal-linear scaling at N=4 (125% efficiency).
- Direct MiniMax is ~30% slower than Skynet-routed MiniMax at every concurrency. Same model, different paths — proxy overhead matters.
- GLM-5.2 is the slowest model at every N — heavy reasoning-token budget burn (~40% of output tokens). Wall-clock at N=5 hits 100s/trial.
- Codex is the middle-of-the-pack — competitive with Skynet minimax but consistently slower than grok-3-mini.
Quality (Phase 1 only — codex/provider comparisons didn’t include judging)
Mean judge score (1-5 LLM-judge, code_review task, minimax-m3) across N=1..5 stayed in the 2.0–3.0 range. Some quality degradation visible at higher N for some tasks (e.g., design_doc N=4 r1 = 2.0). The quality-vs-throughput tradeoff exists but is mild in the tested range.
What’s here
- Methodology — task definitions, prompt templates, judge rubric, dispatcher internals
- Results — per-trial raw + per-concurrency aggregates for all 3 sweeps
- Analysis — the throughput curves, cost scatterplots, hypothesis verdict
- Code —
dispatch_swarm.py+analyze_swarm.pysource + reproduction recipe - Paper — full academic paper with BibTeX citations
Status
Phase 1 complete. All 165 trials shipped, $0 spent. The hypothesis was refuted — that’s the publishable finding. In practice: if you’re already using Skynet’s LiteLLM proxy, dispatch more agents in parallel — they really do scale.