Results

Results

Status: All three sweeps complete. 165 trials total, 0 catastrophic failures (8 mistral-large trials excluded due to user-confirmed rate-limit tier).

Sweep summary

SweepTrialsFailuresCost
Phase 1 (minimax-m3, 5 tasks × 5 conc × 3 reps)750$0
Provider comparison (5 models × 5 conc × 3 reps)757 (all mistral-large)$0
Codex (gpt-5.4, code_review × 5 conc × 3 reps)150$0

Per-trial raw data lives at results/phase1.jsonl, results/provider-comparison.jsonl, results/codex-comparison.jsonl. All gitignored (regenerable).

Per-concurrency aggregates (cross-model)

ModelSourceN=1 wallN=5 wallN=5 throughputN=5 speedupN=5 efficiency
xai/grok-3-miniSkynet27.5s29.3s0.170/s4.7×93.7%
MiniMax-M3direct52.0s61.2s0.082/s4.3×84.9%
minimax-m3Skynet29.3s46.3s0.108/s3.2×63.3%
codex-gpt-5.4codex CLI41.0s56.0s0.089/s3.7×73.1%
zai-coding/glm-5.2Skynet69.4s100.3s0.050/s3.5×69.2%

Per-task quality (Phase 1 — judge scored 1-5)

Mean judge score by task × N (LLM-judge, correctness + completeness + concision, averaged across 3 reps):

TaskN=1N=2N=3N=4N=5
code_review4.222.943.113.082.93
doc_generation3.563.333.043.393.04
test_generation2.221.892.222.251.98
refactor2.442.782.852.672.69
design_doc1.782.222.152.502.24

Quality observations:

  • code_review peaks at N=1 (4.22) and degrades ~30% by N=5 — the only task with clear N-related degradation
  • design_doc is the worst-scored task across all N — judge may be too strict for 2K-word synthesis, or this is just hard for minimax-m3
  • test_generation is consistently low (~2.0) regardless of N — flat-line quality
  • refactor and doc_generation are flat across N

Variance estimates (3 reps per cell)

Standard deviation across reps, per (model, N) cell, for wall_clock_s:

σ(N=1)σ(N=2)σ(N=3)σ(N=4)σ(N=5)
direct-minimax12.513.64.66.814.5
minimax-m3 (skynet)7.25.47.31.59.4
codex-gpt-5.44.98.43.827.52.2
xai/grok-3-mini5.83.02.94.17.7
zai-coding/glm-5.27.67.07.49.812.4

Most cells σ < 15s — improvements smaller than that are within noise. High-σ cells (codex N=4 = 27.5s, direct-minimax N=5 = 14.5s) reflect genuine single-trial outliers.

Cross-model comparison (mistral-large excluded)

ModelBest NPeak speedupCost (5 trials N=5)Notes
xai/grok-3-miniN=45.01×$0Beat ideal-linear at N=4 (125% efficiency)
MiniMax-M3 (direct)N=43.49×$0Slow N=1, scales well
codex-gpt-5.4N=53.66×$0Middle of pack
zai-coding/glm-5.2N=43.26×$0Slow due to reasoning-token budget
minimax-m3 (Skynet)N=53.17×$0Fastest N=1 at 29.3s

Hypothesis scorecard

HypothesisPredictedObservedVerdict
minimax-m3 near-linear to N=5✓ linear to N=5CONFIRMED
Peak at N=2-3 across all tiersAll models continue scaling through N=5REFUTED
Optimal concurrency is 2-3Optimal is N=5 (highest throughput)REFUTED
Sublinear scalingLinear-to-superlinear (grok N=4 = 125% efficiency)REFUTED

Raw data

Per-trial JSONL is downloadable via the analysis script:

python3 scripts/analyze_swarm.py --input results/phase1.jsonl --output-dir static/img/charts/