← Back to home

Published February 23, 2026

Game of Death: AI Tower Challenge cover image

Game of Death: AI Tower Challenge

When AGI?” Wrong question. Here are the 10 ways your agent fails today.

Everyone keeps asking:

“When AGI?”

Wrong question.

The question that matters is: What are the 10 ways my agent fails today — and what gates turn those faceplants into proof?

Because in the real world, “smart” isn’t enough. Discipline beats talent. Receipts beat vibes.

So I built a dojo.


🏯 STILLWATER: GAME OF DEATH — The AGI Tower Challenge

Inspired by Bruce Lee’s unfinished masterpiece: 5 floors. 10 dragons. One technique mastered.

“I fear not the man who practices 10,000 techniques once, but the man who practices one technique 10,000 times.”

In AI, that one technique is verification.

Right now, most agents are clever autocomplete: fast, impressive… fragile. Stillwater trains agents to become something rarer:

a martial artist — reliable, bounded, and auditable.


🎮 The Challenge (you can do this today)

You don’t need to “believe” my claims.

Run this. Earn your belt. Show your stdout.

⚪ White Belt (60 seconds): Proof of Life

If you can’t run the system, nothing else matters.

python -m pip install -e ".[dev]"
PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest -q
python -m nbconvert --execute --to notebook --inplace PHUC-ORCHESTRATION-SECRET-SAUCE.ipynb

✅ If that runs clean, you just did something rare: you produced an artifact a skeptic can replay.

Comment “WHITE BELT” + your stdout summary and I’ll tell you your next dragon.


🗼 The Tower (5 Floors, 10 Dragons)

This isn’t philosophy. It’s a failure-mode map.

Each dragon is a way AI fails in production right now.

FLOOR 1 — HONESTY

🐉 Dragon #1: Hallucination Symptom: Eloquence without evidence. Gate: Lane Algebra (typed claims; “UNKNOWN” allowed).

FLOOR 2 — FOUNDATION

🐉 Dragon #2: Counting / Aggregation Symptom: “Close enough” is wrong. Gate: Counter Bypass (CPU enumerates; LLM classifies only if needed). 🐉 Dragon #3: Context / Memory Rot Symptom: old narrative becomes “truth.” Gate: Context Normal Form (artifacts persist; narrative dies).

FLOOR 3 — PROVING

🐉 Dragon #4: Reasoning Theater Symptom: persuasive stories you can’t audit. Gate: Witness-first reasoning (intermediates + falsifiers). 🐉 Dragon #5: Verification Vibes Symptom: confidence masquerading as proof. Gate: Verification Ladder (pick a rung; emit receipts).

FLOOR 4 — PRECISION

🐉 Dragon #6: Patch Fanfic Symptom: “looks right” code that breaks prod. Gate: RED → GREEN (no patch without a failing test first). 🐉 Dragon #7: Generalization Faceplant Symptom: works once, then collapses. Gate: Replay stability (seed sweep + replay checks). 🐉 Dragon #8: Data Exhaustion Symptom: more text, less progress. Gate: Recipes (replayable units of progress).

FLOOR 5 — MASTERY

🐉 Dragon #9: Alignment (Operational) Symptom: tool-use goes off the rails. Gate: Fail-closed envelope (network OFF by default; bounded IO). 🐉 Dragon #10: Security Symptom: injection + cost explosions + unsafe ops. Gate: Untrusted input firewall + allowlists + evidence gates.


🥋 The Stillwater Vows (why this exists)

Stillwater isn’t trying to summon omniscience.

It’s trying to do something more useful:

If AI is going to act (code, infra, data, decisions), then capability without discipline becomes a liability.


🏅 The Belt System (community mechanics)

This is where the MrBeast energy comes in — not hype, participation.

⚪ White Belt: system runs, tests exit 0 🟡 Yellow Belt: you beat 1 dragon 🟢 Green Belt: you read + embody protocols 🟤 Brown Belt: you face all 10 dragons ⚫ Black Belt: ongoing practice — you ship with receipts

Leaderboard that matters (not model worship):

  1. $/verified patch
  2. time-to-green
  3. rerun survival rate

If your demo can’t survive rerun… it’s a magic trick.


The point (in one line)

AGI isn’t a moment. It’s coverage.

Coverage of failure modes. Coverage enforced by gates. Coverage proven with receipts.


Your invitation

If you build agents, you can join the challenge:

And if you want the full Tower doc / dojo path, it lives in the repo (skills, papers, notebooks, receipts).

Endure. Excel. Evolve. Carpe Diem. — Phuc Vinh Truong