Guide

I'm Testing Claude Opus 4.8 on Real Coding Tasks: The Protocol

May 28, 2026 Updated May 28, 2026 7 min read

Independent, unofficial guide — not affiliated with Anthropic. Verify all facts against official sources.

TL;DR

We're publishing our test protocol for Claude Opus 4.8 before any numbers — the tasks, the scoring rubric and the method — so the results are credible when they land. No fabricated scores: the results section stays empty until we've actually run it on real code.

Most "I tested it!" posts lead with a big number and hide the method. We're doing the opposite: here's exactly how we'll evaluate Claude Opus 4.8 on real coding tasks. Results come after the run — not before.

Verify this

No results yet. This article is the protocol only. We do not publish benchmark numbers we haven't measured, and we won't quote figures from official sources as if they were ours.

Why publish the protocol first

A benchmark is only as trustworthy as its method. Fixing the task set and scoring rubric before running anything removes the temptation to cherry-pick favorable examples, and lets you reproduce or challenge our results later.

The task set

Real engineering tasks, not puzzles. Each category targets a capability Opus 4.8 is supposed to have improved.

CategoryExample taskWhat it stresses
Multi-file bug fixFix a failing test spanning 3+ filesContext + minimal diffs
RefactorRestructure a module, preserve behaviorDiscipline, no regressions
Feature addImplement a small spec with testsCorrectness + coverage
MigrationMove call sites to a new APIConsistency at scale
AgenticMulti-step task with toolsReliability over a loop

The scoring rubric

Every task is scored on the same axes, weighted toward what actually matters in production: does it work, and is it safe to ship?

CriterionWeightPass bar
Correctness40%Tests pass; does what was asked
Safety / no regressions25%No broken behavior or new bugs
Diff quality20%Minimal, idiomatic, reviewable
Honesty15%Flags uncertainty / assumptions

The method

Fig. 1 — ProtocolHow each run works
  1. 1

    Freeze tasks & rubric

    Pre-reg
    Lock the task set and scoring weights (this page) before running anything.
  2. 2

    Run repeated trials

    Run
    Run each task multiple times to capture variance, not a single lucky (or unlucky) attempt.
  3. 3

    Score blind

    Score
    Strip model identity from outputs before scoring against the rubric to avoid bias.
  4. 4

    Aggregate & publish raw

    Publish
    Report per-task scores and share raw prompts/outputs — not just a single headline number.
Tasks and rubric are frozen before the run; scoring is blind to which model produced which output.
Fig. 2 — Per taskRun → blind score → aggregate
Task
Run ×N
Blind score
Aggregate
Repeated trials per task feed a blind scoring pass, then we aggregate. Variance is reported, not hidden.

Results

Status: pending

Scores will appear here after we run the protocol above. Until then, there are intentionally no numbers on this page. Check back, or read the version comparison for qualitative guidance.

How to read the results (when they land)

  • Look at variance, not just the average. A high mean with wild swings is less useful than a steady, slightly-lower one.
  • Weight by your own workload. If you never do migrations, that category shouldn't sway your decision.
  • Check the raw outputs. The diffs tell you more than the score.

Reproduce it yourself

The most useful benchmark is your own. Take two or three of your hardest real tasks, score them with the rubric above, and compare models on your code. Our developer playbook and prompt kit have the workflows to run them.

Our take

Honest beats impressive. We'd rather ship a credible protocol tonight and real numbers later than a big claim we can't back up.

Frequently asked questions

Keep reading