← Back to home

Published February 23, 2026

SWE-bench Isn't a Model Leaderboard. It's a Discipline Leaderboard. cover image

SWE-bench Isn’t a Model Leaderboard. It’s a Discipline Leaderboard.

People argue about which model is “best” at coding.

That argument is already outdated.

Because SWE-bench Verified doesn’t reward confidence. It rewards a system that can repeatedly do one thing:

turn a failing test into a passing test — with receipts.

If your agent can write a patch but can’t prove it works…

it’s not a coding agent.

It’s writing fanfiction about code.


No Verifier = Patch Fanfic. Here’s the Loop That Ships

The uncomfortable truth

Most “AI coding demos” are:

That’s not engineering.

That’s marketing.

Production doesn’t care how smart your patch looks. CI only cares about one thing:

does it pass?

So your agent needs a discipline loop, not a clever prompt.


🐉 The Patch Reliability Dragon

This dragon shows up at 2:17am.

Symptom: “Looks right” code that breaks prod. Cause: patching without gates. Cure: RED → GREEN + a skeptic that’s allowed to say “FAIL.”


The loop that wins (Stillwater-style)

Here’s the workflow I use. It’s boring — which is why it works:

1) Pin the target

2) Make the smallest diff

3) Run the verifier

4) If it fails: patch again or stop cleanly

No wandering. No vibes. No “should work.”


The new leaderboard that matters

Stop ranking “model IQ.” Rank what actually ships.

Leaderboard:

  1. $ / verified patch
  2. time-to-green
  3. rerun survival rate
  4. regression rate (lower is better)

This is how you compare systems honestly.


The receipts (what “proof” looks like)

A real coding agent output should include:

If you don’t have those, you don’t have proof.

You have a story.


MrBeast-style challenge (participation loop)

Let’s make this concrete.

Challenge: Post your RED.

I’ll reply with:

Comment RED and I’ll give you a copy/paste “RED→GREEN Gate” template you can drop into any coding agent prompt.


Tower placement (why this is Floor 4)

In the Stillwater Tower, this is Floor 4: Precision.

You’re not trying to be clever.

You’re trying to be reliable.

Because reliability compounds. And fanfiction doesn’t.


The point (one line)

SWE-bench is not a model contest. It’s a process contest.

The winner is the system that:

— Phuc Vinh Truong