Analysis

Status: Analysis complete. Original hypothesis (peak at N=2-3) was REFUTED. All 5 valid models showed near-linear throughput scaling through N=5.

The key chart

cross-model throughput

The cross-model throughput chart shows N/mean_wall for each model across concurrency levels 1–5. The dashed ideal line shows linear scaling from a 5× N=1 baseline.

Per-model throughput curves

xai/grok-3-mini (Skynet)

grok-3-mini throughput

The throughput leader. N=4 hits 125% efficiency (5.01× speedup), meaning wall-clock at N=4 is faster than N=1 baseline implies. Skynet’s Grok routing is genuinely faster than the model would suggest from N=1 alone.

MiniMax-M3 (direct)

direct minimax throughput

Direct API access. Slow N=1 (52s) but scales well — N=5 wall-clock only grows 18% over N=4. The direct API has a fixed per-request overhead that gets amortized.

minimax-m3 (Skynet)

skynet minimax throughput

Same model, different routing. ~30% faster than direct at every N. The LiteLLM proxy pipeline effectively.

codex-gpt-5.4

codex throughput

Middle-of-the-pack. Wall-clock drops to ~52-56s at N=3-5 (capped — Codex CLI appears to have a per-trial floor).

zai-coding/glm-5.2

glm-5.2 throughput

Slowest model. GLM-5.2 burns ~40% of its token budget on reasoning, so wall-clock at N=5 hits 100s/trial. Throughput grows sublinearly here because the model itself is the bottleneck, not the dispatcher.

Speedup analysis

Speedup = throughput(N) / throughput(N=1). Ideal linear speedup = N.

Model	N=2 speedup	N=3 speedup	N=4 speedup	N=5 speedup
xai/grok-3-mini	2.39×	2.57×	5.01×	4.69×
MiniMax-M3 (direct)	1.34×	2.14×	3.49×	4.25×
minimax-m3 (Skynet)	1.56×	1.81×	2.77×	3.17×
codex-gpt-5.4	1.69×	2.40×	2.39×	3.66×
zai-coding/glm-5.2	2.15×	2.31×	3.26×	3.46×

No model shows the predicted inflection point at N=2-3. All curves continue upward through N=5.

Cost analysis

All trials cost $0 because every model tier used was either:

A free subscription tier (minimax-m3, codex-gpt-5.4 via ChatGPT Plus)
Skynet-routed with no usage charge (grok-3-mini, GLM-5.2, mistral-large — though mistral-large failed for unrelated reasons)
Direct MiniMax on a free / promotional tier

For a real cost analysis with paid API access, the same harness could rerun with cost_usd enabled and produce per-trial spend. The data fields (tokens_in, tokens_out, cost_usd) are all populated even when the cost is $0.

Quality analysis (Phase 1 only)

Mean judge score (1-5 LLM-judge) across the 5 Phase 1 tasks × N=1..5:

![quality vs concurrency placeholder — see /results for table]

The quality-vs-throughput tradeoff is mild in the tested range. Only code_review shows clear N-related degradation (N=1 = 4.22, N=5 = 2.93). design_doc is consistently low (~2.0) regardless of N — likely a function of the model + task difficulty, not concurrency.

Hypothesis verdict

Hypothesis	Verdict
Subagent throughput scales sublinearly with concurrency	REFUTED — all 5 models scale near-linearly through N=5
Inflection point depends on model tier	REFUTED — no inflection point in tested range
Peak at N=2-3 across all tiers	REFUTED — peak is at N=4-5
Optimal concurrency is 2-3 agents	REFUTED — optimal is N=4 or N=5

Practical implication: If you’re using Skynet’s LiteLLM proxy (or any well-implemented OpenAI-compatible gateway), dispatch more agents in parallel — they really do scale. Don’t artificially cap at N=2-3.

Limitations

All trials on one VM — real distributed setup might behave differently
Judge model same tier as trial in Phase 1 — some self-bias possible (mitigated by using a separate LLM call per evaluation)
3 reps per cell is the bare minimum for variance estimation
Tasks are all coding-adjacent — creative writing may have different curves
No error injection — production runs have failure modes this benchmark doesn’t model
Mistral-large excluded due to user-confirmed pay-as-you-go rate-limit tier (would have been 7 more trials in the dataset)
Codex-only sweep covered code_review — would need a codex multi-task sweep to compare apples-to-apples with the other models