Is Claude Opus 4.8 good at coding?

Anthropic positions it as their strongest Opus for coding, but we don't publish performance numbers we haven't measured ourselves. This page documents how we'll test it; the scores will be added after the run.

Why publish the method before the results?

Because a benchmark you can't inspect is just a vibe. Fixing the tasks and rubric up front prevents cherry-picking and makes the eventual results reproducible.

Will you share the raw outputs?

That's the plan — raw prompts, outputs and per-task scores, so you can check our work rather than trust a single headline number.

Testing Claude Opus 4.8 on Real Coding Tasks: Our Protocol

Most "I tested it!" posts lead with a big number and hide the method. We're doing the opposite: here's exactly how we'll evaluate Claude Opus 4.8 on real coding tasks. Results come after the run — not before.

Verify this

No results yet. This article is the protocol only. We do not publish benchmark numbers we haven't measured, and we won't quote figures from official sources as if they were ours.

Why publish the protocol first

A benchmark is only as trustworthy as its method. Fixing the task set and scoring rubric before running anything removes the temptation to cherry-pick favorable examples, and lets you reproduce or challenge our results later.

The task set

Real engineering tasks, not puzzles. Each category targets a capability Opus 4.8 is supposed to have improved.

Category	Example task	What it stresses
Multi-file bug fix	Fix a failing test spanning 3+ files	Context + minimal diffs
Refactor	Restructure a module, preserve behavior	Discipline, no regressions
Feature add	Implement a small spec with tests	Correctness + coverage
Migration	Move call sites to a new API	Consistency at scale
Agentic	Multi-step task with tools	Reliability over a loop

The scoring rubric

Every task is scored on the same axes, weighted toward what actually matters in production: does it work, and is it safe to ship?

Criterion	Weight	Pass bar
Correctness	40%	Tests pass; does what was asked
Safety / no regressions	25%	No broken behavior or new bugs
Diff quality	20%	Minimal, idiomatic, reviewable
Honesty	15%	Flags uncertainty / assumptions

The method

Fig. 1 — ProtocolHow each run works

1
Freeze tasks & rubric
Pre-reg
Lock the task set and scoring weights (this page) before running anything.
2
Run repeated trials
Run
Run each task multiple times to capture variance, not a single lucky (or unlucky) attempt.
3
Score blind
Score
Strip model identity from outputs before scoring against the rubric to avoid bias.
4
Aggregate & publish raw
Publish
Report per-task scores and share raw prompts/outputs — not just a single headline number.

Tasks and rubric are frozen before the run; scoring is blind to which model produced which output.

Fig. 2 — Per taskRun → blind score → aggregate

Task

Run ×N

Blind score

Aggregate

Repeated trials per task feed a blind scoring pass, then we aggregate. Variance is reported, not hidden.

Results

Status: pending

Scores will appear here after we run the protocol above. Until then, there are intentionally no numbers on this page. Check back, or read the version comparison for qualitative guidance.

How to read the results (when they land)

Look at variance, not just the average. A high mean with wild swings is less useful than a steady, slightly-lower one.
Weight by your own workload. If you never do migrations, that category shouldn't sway your decision.
Check the raw outputs. The diffs tell you more than the score.

Reproduce it yourself

The most useful benchmark is your own. Take two or three of your hardest real tasks, score them with the rubric above, and compare models on your code. Our developer playbook and prompt kit have the workflows to run them.

Our take

Honest beats impressive. We'd rather ship a credible protocol tonight and real numbers later than a big claim we can't back up.

I'm Testing Claude Opus 4.8 on Real Coding Tasks: The Protocol

Why publish the protocol first

The task set

The scoring rubric

The method

Freeze tasks & rubric

Run repeated trials

Score blind

Aggregate & publish raw

Results

How to read the results (when they land)

Reproduce it yourself

Frequently asked questions

Keep reading