Four Weeks Ago I Called GPT-5.5 the Best Coding AI. Opus 4.8 Found What It Missed.
On day one, I pointed Opus 4.8 at a project GPT-5.5 had been building and asked four open questions. It read the full codebase, ran every test, checked its own claims, and found the gaps that mattered most.
Four weeks ago I published a workflow post. The headline was direct: “GPT-5.5 on Codex is the best coding AI available right now. It is not close.” I split my daily work into three lanes. Claude for thinking and iteration, GPT-5.5 on Codex for building, Claude for design. I also said something that matters now: this is a snapshot, not a permanent recommendation. The tools move fast. Your workflow should too.
Anthropic shipped Opus 4.8 today. This is the revisit I said was coming.
I covered the benchmarks, pricing, and new features in a separate post earlier today. [1] The short version: Anthropic calls it “a modest but tangible improvement,” the coding score moved from 64.3 to 69.2 percent, and the effort controls and cheaper fast mode are the changes that matter most for builders managing costs at scale. [2][3] That post covers what shipped. This one covers what it did when I pointed it at real work.
What I tested
I had an app in active development that GPT-5.5 Pro had been building on Codex. Functional, with a passing test suite, but not finished. Not user-facing ready. I had been iterating on it for a while, and the iteration had stalled. There were gaps I knew about, and I had asked GPT-5.5 to identify and address them. It hadn’t done it correctly. That frustration was part of why development had paused.
I pointed Opus 4.8 at the project in Claude Code. The prompt was deliberately minimal: review this app, what’s missing, what can improve, can you do something better? Four open questions. No context about what I wanted fixed. No hints about where the problems were. I wanted to see what the model would find on its own.
What it did
It did not start with opinions. It read the full codebase in structured passes, file by file, backend through frontend. It ran the entire test suite, confirmed every test passed, and then searched the codebase for specific terms to verify the claims it was forming. Only after all of that did it write a single line of review.
The review separated what was genuinely well-built from what only looked finished. It identified two structural gaps, both in areas that would directly affect whether a user trusts the product. Not style preferences. Not edge cases. The kind of findings that explain why the iteration had stalled.
It ranked every gap by how much user-facing credibility each one would recover. It named which ones were cosmetic and which ones were structural. And when it got to the last question, whether it could do something better, it did not promise a rewrite. It said the foundation was solid, the architecture didn’t need replacing, and the real value was in finishing what was already there. It offered three specific next steps in priority order and asked whether I wanted it to start with the top two.
A model talking itself out of more work, because more work was the wrong call, is not the behavior you expect on day one of a release.
Why this changes my workflow
I had asked GPT-5.5 Pro to find these same problems. It hadn’t. That is the detail that separates this from a reviewer catching expected gaps in an unfinished product. Both models were given the chance to identify what was wrong. One did. One didn’t. And the one that did was working from a vaguer, more open-ended prompt with less hand-holding.
This lines up with what Anthropic leads with in the launch post. The company says Opus 4.8 is about four times less likely than 4.7 to let flaws in its own code pass unremarked. [2] Those are internal numbers, so apply the discount you would to any vendor’s self-reported results. One of the testers quoted in the launch notes, a senior investment associate, describes the same pattern from a different domain: a model that proactively flags issues other tools routinely miss and leave for the user to catch. [2] Different field, same behavior. What I saw in one session matches what Anthropic claims at the aggregate level. That is a signal worth noting, not a verdict.
What it doesn’t settle
This was one project, one review, one session, on day one. The app was still in active development, and further iteration with GPT-5.5 might have surfaced these gaps eventually. A model finding unfinished work in an unfinished product is less remarkable on its own. What’s remarkable is the model correctly prioritizing which unfinished work mattered most, verifying its own claims before committing to them, and declining the ambitious move in favor of the right one. Whether it does that reliably across many sessions is a question I can’t answer yet.
The standing caveat from four weeks ago still applies. It could all move next week. OpenAI could push GPT-5.5 forward. Anthropic ships again before the last release has cooled. Snapshots are snapshots.
But the workflow I published a month ago is up for revision. GPT-5.5 on Codex is still a serious build tool, and one session does not change that. What it changes is the assumption that the question was settled. I said the tools move fast and your workflow should too. I am taking my own advice.
Anthropic used the word “modest” in the launch post. [2] On the benchmark table, that is fair. On the judgment, the verification, and the willingness to say “the right answer here is less work, not more,” it is not the word I would use. It’s much more than modest. I’m impressed.
References
[1] nwslyr, “Claude Opus 4.8: What Changed, What It Costs, and What It Means for Builders,” May 28, 2026. https://www.nwslyr.com/blog/claude-opus-4-8-what-changed-costs-builders/
[2] Anthropic, “Introducing Claude Opus 4.8,” May 28, 2026. https://www.anthropic.com/news/claude-opus-4-8
[3] 9to5Mac, “Anthropic upgrades Claude with new Opus 4.8 model, here’s what’s new,” May 28, 2026. https://9to5mac.com/2026/05/28/anthropic-upgrades-claude-with-new-opus-4-8-model-heres-whats-new/