Methodology
Methodology
The 5 task templates
All tasks take a fixed input fixture (a real small repo / function / doc the agent has never seen) and produce output judged against a fixed rubric. This way the quality score is comparable across model tiers, not just within them.
| # | Task | Input size | Output size | Why |
|---|---|---|---|---|
| 1 | Code review | 2K LOC | 1K review | Tests structured-analysis ability; classic use case |
| 2 | Doc generation | 500 LOC | 1.5K doc | Tests structured prose generation |
| 3 | Unit tests | 800 LOC | 1.2K tests | Tests code comprehension + test pattern knowledge |
| 4 | Refactor proposal | 1K LOC | 800 design | Tests architectural reasoning |
| 5 | Design doc | 200 LOC | 2K doc | Tests long-form synthesis from small inputs |
Concurrency matrix
Each (task, concurrency) cell is run 3 times for variance estimation.
| N=1 | N=2 | N=3 | N=4 | N=5 | |
|---|---|---|---|---|---|
| code-review | ✓ | ✓ | ✓ | ✓ | ✓ |
| doc-gen | ✓ | ✓ | ✓ | ✓ | ✓ |
| test-gen | ✓ | ✓ | ✓ | ✓ | ✓ |
| refactor | ✓ | ✓ | ✓ | ✓ | ✓ |
| design-doc | ✓ | ✓ | ✓ | ✓ | ✓ |
= 75 trials per model tier
Phase plan
| Phase | Model tiers | Budget | Status |
|---|---|---|---|
| 1 | minimax-m3 | $0 | in progress |
| 2 | Opus 4.5 | ~$40 | pending Phase 1 results |
| 2 | Grok (imagine text) | ~$5 | pending Phase 1 results |
| 2 | GLM-5.2 | ~$30 | pending Phase 1 results |
If Phase 1 reveals the harness is broken (judge is noisy, fixtures leak, concurrency not actually parallel), Phase 2 is paused until the bug is fixed.
Per-trial measurements
Each trial records:
model— model identifier (e.g.minimax-m3,claude-opus-4-5)task_id— 1..5concurrency— 1..5rep— 1..3 (variance estimation)trial_id— uuidstarted_at,finished_at— ISO timestampswall_clock_seconds— dispatch start → all-N donetokens_in,tokens_out— totals across N agentscost_usd— total costper_agent_outputs— list of N raw outputsper_agent_metrics— list of N (tokens_in/out, cost, latency)judge_scores— list of N LLM-judge scores (1–5)mechanical_checks— completion, length, coherence
Judge rubric
Quality is scored 1–5 by a separate LLM-judge call (same model tier as the trial) on three dimensions:
- Correctness — does the output achieve the stated task?
- Completeness — does it cover what was asked?
- Concision — is it free of padding / filler / repetition?
Final score = mean of three dimensions.
Concurrency mechanics
The dispatcher uses Python concurrent.futures.ThreadPoolExecutor with N=concurrency workers. Each worker is an independent claude -p subprocess (for Claude models) or HTTP request (for Skynet/OpenAI-compatible providers). Wall-clock is measured from T0 (first worker spawned) to T_done (last worker finishes).
Token window guardrails:
- Claude Code/Opus via OAuth Max: 5-hour rolling window — track usage, hard-stop at 60% of budget per hour
- Grok SuperGrok Heavy: ~500 images/day quota, treat as token-equivalent ~150K/day
- GLM-5.2: 1305 throttle on big system prompts — force compact prompt override via
~/.hermes/prompts/glm-5.2-compact.txt - minimax-m3: effectively free, no constraint
The fixtures
Input fixtures are pinned at fixed git SHAs to prevent drift across trials. See tasks/fixtures/ for the per-task source code + README explaining why it was chosen.