Methodology

Methodology

The 5 task templates

All tasks take a fixed input fixture (a real small repo / function / doc the agent has never seen) and produce output judged against a fixed rubric. This way the quality score is comparable across model tiers, not just within them.

#TaskInput sizeOutput sizeWhy
1Code review2K LOC1K reviewTests structured-analysis ability; classic use case
2Doc generation500 LOC1.5K docTests structured prose generation
3Unit tests800 LOC1.2K testsTests code comprehension + test pattern knowledge
4Refactor proposal1K LOC800 designTests architectural reasoning
5Design doc200 LOC2K docTests long-form synthesis from small inputs

Concurrency matrix

Each (task, concurrency) cell is run 3 times for variance estimation.

N=1N=2N=3N=4N=5
code-review
doc-gen
test-gen
refactor
design-doc

= 75 trials per model tier

Phase plan

PhaseModel tiersBudgetStatus
1minimax-m3$0in progress
2Opus 4.5~$40pending Phase 1 results
2Grok (imagine text)~$5pending Phase 1 results
2GLM-5.2~$30pending Phase 1 results

If Phase 1 reveals the harness is broken (judge is noisy, fixtures leak, concurrency not actually parallel), Phase 2 is paused until the bug is fixed.

Per-trial measurements

Each trial records:

  • model — model identifier (e.g. minimax-m3, claude-opus-4-5)
  • task_id — 1..5
  • concurrency — 1..5
  • rep — 1..3 (variance estimation)
  • trial_id — uuid
  • started_at, finished_at — ISO timestamps
  • wall_clock_seconds — dispatch start → all-N done
  • tokens_in, tokens_out — totals across N agents
  • cost_usd — total cost
  • per_agent_outputs — list of N raw outputs
  • per_agent_metrics — list of N (tokens_in/out, cost, latency)
  • judge_scores — list of N LLM-judge scores (1–5)
  • mechanical_checks — completion, length, coherence

Judge rubric

Quality is scored 1–5 by a separate LLM-judge call (same model tier as the trial) on three dimensions:

  • Correctness — does the output achieve the stated task?
  • Completeness — does it cover what was asked?
  • Concision — is it free of padding / filler / repetition?

Final score = mean of three dimensions.

Concurrency mechanics

The dispatcher uses Python concurrent.futures.ThreadPoolExecutor with N=concurrency workers. Each worker is an independent claude -p subprocess (for Claude models) or HTTP request (for Skynet/OpenAI-compatible providers). Wall-clock is measured from T0 (first worker spawned) to T_done (last worker finishes).

Token window guardrails:

  • Claude Code/Opus via OAuth Max: 5-hour rolling window — track usage, hard-stop at 60% of budget per hour
  • Grok SuperGrok Heavy: ~500 images/day quota, treat as token-equivalent ~150K/day
  • GLM-5.2: 1305 throttle on big system prompts — force compact prompt override via ~/.hermes/prompts/glm-5.2-compact.txt
  • minimax-m3: effectively free, no constraint

The fixtures

Input fixtures are pinned at fixed git SHAs to prevent drift across trials. See tasks/fixtures/ for the per-task source code + README explaining why it was chosen.