Methodology

The 5 task templates

All tasks take a fixed input fixture (a real small repo / function / doc the agent has never seen) and produce output judged against a fixed rubric. This way the quality score is comparable across model tiers, not just within them.

#	Task	Input size	Output size	Why
1	Code review	2K LOC	1K review	Tests structured-analysis ability; classic use case
2	Doc generation	500 LOC	1.5K doc	Tests structured prose generation
3	Unit tests	800 LOC	1.2K tests	Tests code comprehension + test pattern knowledge
4	Refactor proposal	1K LOC	800 design	Tests architectural reasoning
5	Design doc	200 LOC	2K doc	Tests long-form synthesis from small inputs

Concurrency matrix

Each (task, concurrency) cell is run 3 times for variance estimation.

	N=1	N=2	N=3	N=4	N=5
code-review	✓	✓	✓	✓	✓
doc-gen	✓	✓	✓	✓	✓
test-gen	✓	✓	✓	✓	✓
refactor	✓	✓	✓	✓	✓
design-doc	✓	✓	✓	✓	✓

= 75 trials per model tier

Phase plan

Phase	Model tiers	Budget	Status
1	minimax-m3	$0	in progress
2	Opus 4.5	~$40	pending Phase 1 results
2	Grok (imagine text)	~$5	pending Phase 1 results
2	GLM-5.2	~$30	pending Phase 1 results

If Phase 1 reveals the harness is broken (judge is noisy, fixtures leak, concurrency not actually parallel), Phase 2 is paused until the bug is fixed.

Per-trial measurements

Each trial records:

model — model identifier (e.g. minimax-m3, claude-opus-4-5)
task_id — 1..5
concurrency — 1..5
rep — 1..3 (variance estimation)
trial_id — uuid
started_at, finished_at — ISO timestamps
wall_clock_seconds — dispatch start → all-N done
tokens_in, tokens_out — totals across N agents
cost_usd — total cost
per_agent_outputs — list of N raw outputs
per_agent_metrics — list of N (tokens_in/out, cost, latency)
judge_scores — list of N LLM-judge scores (1–5)
mechanical_checks — completion, length, coherence

Judge rubric

Quality is scored 1–5 by a separate LLM-judge call (same model tier as the trial) on three dimensions:

Correctness — does the output achieve the stated task?
Completeness — does it cover what was asked?
Concision — is it free of padding / filler / repetition?

Final score = mean of three dimensions.

Concurrency mechanics

The dispatcher uses Python concurrent.futures.ThreadPoolExecutor with N=concurrency workers. Each worker is an independent claude -p subprocess (for Claude models) or HTTP request (for Skynet/OpenAI-compatible providers). Wall-clock is measured from T0 (first worker spawned) to T_done (last worker finishes).

Token window guardrails:

Claude Code/Opus via OAuth Max: 5-hour rolling window — track usage, hard-stop at 60% of budget per hour
Grok SuperGrok Heavy: ~500 images/day quota, treat as token-equivalent ~150K/day
GLM-5.2: 1305 throttle on big system prompts — force compact prompt override via ~/.hermes/prompts/glm-5.2-compact.txt
minimax-m3: effectively free, no constraint

The fixtures

Input fixtures are pinned at fixed git SHAs to prevent drift across trials. See tasks/fixtures/ for the per-task source code + README explaining why it was chosen.