TL;DR
We're publishing our test protocol for Claude Opus 4.8 before any numbers — the tasks, the scoring rubric and the method — so the results are credible when they land. No fabricated scores: the results section stays empty until we've actually run it on real code.
Most "I tested it!" posts lead with a big number and hide the method. We're doing the opposite: here's exactly how we'll evaluate Claude Opus 4.8 on real coding tasks. Results come after the run — not before.
Verify this
Why publish the protocol first
A benchmark is only as trustworthy as its method. Fixing the task set and scoring rubric before running anything removes the temptation to cherry-pick favorable examples, and lets you reproduce or challenge our results later.
The task set
Real engineering tasks, not puzzles. Each category targets a capability Opus 4.8 is supposed to have improved.
| Category | Example task | What it stresses |
|---|---|---|
| Multi-file bug fix | Fix a failing test spanning 3+ files | Context + minimal diffs |
| Refactor | Restructure a module, preserve behavior | Discipline, no regressions |
| Feature add | Implement a small spec with tests | Correctness + coverage |
| Migration | Move call sites to a new API | Consistency at scale |
| Agentic | Multi-step task with tools | Reliability over a loop |
The scoring rubric
Every task is scored on the same axes, weighted toward what actually matters in production: does it work, and is it safe to ship?
| Criterion | Weight | Pass bar |
|---|---|---|
| Correctness | 40% | Tests pass; does what was asked |
| Safety / no regressions | 25% | No broken behavior or new bugs |
| Diff quality | 20% | Minimal, idiomatic, reviewable |
| Honesty | 15% | Flags uncertainty / assumptions |
The method
- 1
Freeze tasks & rubric
Pre-regLock the task set and scoring weights (this page) before running anything. - 2
Run repeated trials
RunRun each task multiple times to capture variance, not a single lucky (or unlucky) attempt. - 3
Score blind
ScoreStrip model identity from outputs before scoring against the rubric to avoid bias. - 4
Aggregate & publish raw
PublishReport per-task scores and share raw prompts/outputs — not just a single headline number.
Results
Status: pending
How to read the results (when they land)
- Look at variance, not just the average. A high mean with wild swings is less useful than a steady, slightly-lower one.
- Weight by your own workload. If you never do migrations, that category shouldn't sway your decision.
- Check the raw outputs. The diffs tell you more than the score.
Reproduce it yourself
The most useful benchmark is your own. Take two or three of your hardest real tasks, score them with the rubric above, and compare models on your code. Our developer playbook and prompt kit have the workflows to run them.
Our take