← Back to home

Published February 23, 2026

A child can count better than Frontier AI Models cover image

A child can count better than Frontier AI Models

If Your “Frontier” Model Can’t Count, You Don’t Have Reasoning. You Have Vibes.

Here’s a truth that makes people mad:

A 128K context window doesn’t mean your model can count.

It means it can read more while still doing approximate attention math.

So when you ask an LLM to aggregate a long document—count categories, sum totals, tally events— you’re not testing “reasoning.”

You’re testing whether it can do exact arithmetic using a tool that isn’t exact.

That’s why OOLONG hurts.

Here’s the cheat code: don’t let the LLM count.


Stop Asking LLMs to Aggregate. Make the CPU Count.

🐉 The Counting Dragon

This dragon shows up everywhere:

A pure LLM answer often feels right… until you verify it.

And in production, “close enough” is wrong.


The principle (one line)

Use the LLM for classification. Use deterministic code for aggregation.

That’s it.

That’s the whole trick.


The Counter-Bypass Protocol (Stillwater style)

This is the protocol I use to beat long-context aggregation traps:

1) Parse → structured rows (deterministic)

Turn the big blob into rows/fields.

2) (Optional) LLM labels each row

Only if the classification requires fuzzy semantics.

3) CPU aggregates (required)

Use Counter(), sum(), exact integers/fractions.

4) Verify with executable markers

Emit counts + a reproducible audit trail:

LLM is the interpreter. CPU is the accountant. Never swap those.


A simple example (that scales)

Let’s say you have 5,000 support tickets and you ask:

“How many are billing vs bugs vs feature requests?”

Bad approach:

Disciplined approach:

Now your result isn’t “a vibe.”

It’s a report you can audit.


Why attention fails at counting (plain-English)

LLMs don’t “loop” over tokens the way your code loops over rows.

They compress patterns, and compression is approximate.

Counting is not pattern-completion.

Counting is enumeration.

So the fix is not “think harder.”

The fix is: use the right tool.


The MrBeast-style challenge (viral loop)

Let’s do a public sparring test.

Challenge: paste a long chunk of text (or a dataset summary) and ask for an exact count or aggregation.

I’ll reply with:

Comment: COUNT and I’ll drop the “Counter-Bypass Card” you can paste into your agent prompt.


Bonus: the Tower framing (makes it memorable)

This is Floor 2: Foundation in the Stillwater Tower.

If your agent can’t beat the Counting Dragon, it shouldn’t be touching:

Because those domains punish “almost.”


The point (one line)

Long context isn’t memory. It’s a bigger whiteboard. If you want correctness: determinism + receipts.


Endure. Excel. Evolve. — Phuc Vinh Truong