Analysis

Analysis

Status: Analysis complete. Original hypothesis (peak at N=2-3) was REFUTED. All 5 valid models showed near-linear throughput scaling through N=5.

The key chart

cross-model throughput

The cross-model throughput chart shows N/mean_wall for each model across concurrency levels 1–5. The dashed ideal line shows linear scaling from a 5× N=1 baseline.

Per-model throughput curves

xai/grok-3-mini (Skynet)

grok-3-mini throughput

The throughput leader. N=4 hits 125% efficiency (5.01× speedup), meaning wall-clock at N=4 is faster than N=1 baseline implies. Skynet’s Grok routing is genuinely faster than the model would suggest from N=1 alone.

MiniMax-M3 (direct)

direct minimax throughput

Direct API access. Slow N=1 (52s) but scales well — N=5 wall-clock only grows 18% over N=4. The direct API has a fixed per-request overhead that gets amortized.

minimax-m3 (Skynet)

skynet minimax throughput

Same model, different routing. ~30% faster than direct at every N. The LiteLLM proxy pipeline effectively.

codex-gpt-5.4

codex throughput

Middle-of-the-pack. Wall-clock drops to ~52-56s at N=3-5 (capped — Codex CLI appears to have a per-trial floor).

zai-coding/glm-5.2

glm-5.2 throughput

Slowest model. GLM-5.2 burns ~40% of its token budget on reasoning, so wall-clock at N=5 hits 100s/trial. Throughput grows sublinearly here because the model itself is the bottleneck, not the dispatcher.

Speedup analysis

Speedup = throughput(N) / throughput(N=1). Ideal linear speedup = N.

ModelN=2 speedupN=3 speedupN=4 speedupN=5 speedup
xai/grok-3-mini2.39×2.57×5.01×4.69×
MiniMax-M3 (direct)1.34×2.14×3.49×4.25×
minimax-m3 (Skynet)1.56×1.81×2.77×3.17×
codex-gpt-5.41.69×2.40×2.39×3.66×
zai-coding/glm-5.22.15×2.31×3.26×3.46×

No model shows the predicted inflection point at N=2-3. All curves continue upward through N=5.

Cost analysis

All trials cost $0 because every model tier used was either:

  • A free subscription tier (minimax-m3, codex-gpt-5.4 via ChatGPT Plus)
  • Skynet-routed with no usage charge (grok-3-mini, GLM-5.2, mistral-large — though mistral-large failed for unrelated reasons)
  • Direct MiniMax on a free / promotional tier

For a real cost analysis with paid API access, the same harness could rerun with cost_usd enabled and produce per-trial spend. The data fields (tokens_in, tokens_out, cost_usd) are all populated even when the cost is $0.

Quality analysis (Phase 1 only)

Mean judge score (1-5 LLM-judge) across the 5 Phase 1 tasks × N=1..5:

![quality vs concurrency placeholder — see /results for table]

The quality-vs-throughput tradeoff is mild in the tested range. Only code_review shows clear N-related degradation (N=1 = 4.22, N=5 = 2.93). design_doc is consistently low (~2.0) regardless of N — likely a function of the model + task difficulty, not concurrency.

Hypothesis verdict

HypothesisVerdict
Subagent throughput scales sublinearly with concurrencyREFUTED — all 5 models scale near-linearly through N=5
Inflection point depends on model tierREFUTED — no inflection point in tested range
Peak at N=2-3 across all tiersREFUTED — peak is at N=4-5
Optimal concurrency is 2-3 agentsREFUTED — optimal is N=4 or N=5

Practical implication: If you’re using Skynet’s LiteLLM proxy (or any well-implemented OpenAI-compatible gateway), dispatch more agents in parallel — they really do scale. Don’t artificially cap at N=2-3.

Limitations

  • All trials on one VM — real distributed setup might behave differently
  • Judge model same tier as trial in Phase 1 — some self-bias possible (mitigated by using a separate LLM call per evaluation)
  • 3 reps per cell is the bare minimum for variance estimation
  • Tasks are all coding-adjacent — creative writing may have different curves
  • No error injection — production runs have failure modes this benchmark doesn’t model
  • Mistral-large excluded due to user-confirmed pay-as-you-go rate-limit tier (would have been 7 more trials in the dataset)
  • Codex-only sweep covered code_review — would need a codex multi-task sweep to compare apples-to-apples with the other models