Subagent Swarm

The question

Every parallel agent dispatch trades wall-clock speed for API cost + coordination overhead. I dispatch Claude subagents all the time — for code review, audit tasks, sub-investigations — but I’ve never actually measured the throughput curve. Does N parallel agents give me N× throughput, or does it plateau? Where’s the inflection point? Does the curve differ by model tier?

Without a real curve, I’d been guessing. This project measures it.

The hypothesis (REFUTED)

Subagent throughput scales sublinearly with concurrency, with the inflection point depending on model tier.
minimax-m3 (cheap): near-linear to N=5, then plateau
Opus 4.5 (mid-tier): peaks at N=3, degrades by N=5 (rate-limit bound)
Fable 5 (craft-tier): peaks at N=2, hard ceiling at N=3 (token-cost bound)
Net result: optimal concurrency is 2–3 agents, regardless of model tier. Above that, marginal output per dollar goes negative.

Verdict: REFUTED. Throughput scales near-linearly up to N=5 on every model tested. The original guess that “peak is at N=2–3 across all tiers” was wrong — Skynet’s LiteLLM proxy pipelines concurrent requests well enough that no inflection point appeared in the tested range. Grok-3-mini actually hit 125% efficiency at N=4 (5.01× speedup).

The setup


Trials	165 total: Phase 1 (75, minimax-m3) + provider comparison (75, 5 models) + codex comparison (15, codex-gpt-5.4)
Concurrency levels	1, 2, 3, 4, 5 parallel agents
Models tested	direct-MiniMax-M3, Skynet minimax-m3, codex-gpt-5.4, xai/grok-3-mini, zai-coding/glm-5.2
Per-trial metrics	Wall-clock, tokens in/out, cost, output quality (LLM-judge for Phase 1 only)
Total cost	$0 — all subscriptions + free tiers

Cross-model throughput (real data)

cross-model throughput

N=5 throughput summary (mistral-large excluded from this comparison per user — pay-as-you-go tier with rate limits that ate its high-concurrency trials):

Model	N=1 wall	N=5 wall	N=5 throughput	N=5 speedup	N=5 efficiency
xai/grok-3-mini	27.5s	29.3s	0.170/s	4.7×	93.7%
MiniMax-M3 (direct)	52.0s	61.2s	0.082/s	4.3×	84.9%
minimax-m3 (Skynet)	29.3s	46.3s	0.108/s	3.2×	63.3%
codex-gpt-5.4	41.0s	56.0s	0.089/s	3.7×	73.1%
zai-coding/glm-5.2	69.4s	100.3s	0.050/s	3.5×	69.2%

Headline findings

Hypothesis (peak at N=2-3) was wrong. All 5 valid models showed near-linear throughput scaling through N=5, with no clear inflection point.
Grok-3-mini is the throughput leader — 0.170 tasks/sec at N=5, the only model to beat ideal-linear scaling at N=4 (125% efficiency).
Direct MiniMax is ~30% slower than Skynet-routed MiniMax at every concurrency. Same model, different paths — proxy overhead matters.
GLM-5.2 is the slowest model at every N — heavy reasoning-token budget burn (~40% of output tokens). Wall-clock at N=5 hits 100s/trial.
Codex is the middle-of-the-pack — competitive with Skynet minimax but consistently slower than grok-3-mini.

Quality (Phase 1 only — codex/provider comparisons didn’t include judging)

Mean judge score (1-5 LLM-judge, code_review task, minimax-m3) across N=1..5 stayed in the 2.0–3.0 range. Some quality degradation visible at higher N for some tasks (e.g., design_doc N=4 r1 = 2.0). The quality-vs-throughput tradeoff exists but is mild in the tested range.

What’s here

Methodology — task definitions, prompt templates, judge rubric, dispatcher internals
Results — per-trial raw + per-concurrency aggregates for all 3 sweeps
Analysis — the throughput curves, cost scatterplots, hypothesis verdict
Code — dispatch_swarm.py + analyze_swarm.py source + reproduction recipe
Paper — full academic paper with BibTeX citations

Status

Phase 1 complete. All 165 trials shipped, $0 spent. The hypothesis was refuted — that’s the publishable finding. In practice: if you’re already using Skynet’s LiteLLM proxy, dispatch more agents in parallel — they really do scale.