A child can count better than Frontier AI Models

If Your “Frontier” Model Can’t Count, You Don’t Have Reasoning. You Have Vibes.

Here’s a truth that makes people mad:

A 128K context window doesn’t mean your model can count.

It means it can read more while still doing approximate attention math.

So when you ask an LLM to aggregate a long document—count categories, sum totals, tally events— you’re not testing “reasoning.”

You’re testing whether it can do exact arithmetic using a tool that isn’t exact.

That’s why OOLONG hurts.

Here’s the cheat code: don’t let the LLM count.

Stop Asking LLMs to Aggregate. Make the CPU Count.

🐉 The Counting Dragon

This dragon shows up everywhere:

“How many customers mentioned churn risk?”
“How many times does the term X appear?”
“What’s the total revenue by segment?”
“Count the number of records that match criteria Y”

A pure LLM answer often feels right… until you verify it.

And in production, “close enough” is wrong.

The principle (one line)

Use the LLM for classification. Use deterministic code for aggregation.

That’s it.

That’s the whole trick.

The Counter-Bypass Protocol (Stillwater style)

This is the protocol I use to beat long-context aggregation traps:

1) Parse → structured rows (deterministic)

Turn the big blob into rows/fields.

2) (Optional) LLM labels each row

Only if the classification requires fuzzy semantics.

3) CPU aggregates (required)

Use Counter(), sum(), exact integers/fractions.

4) Verify with executable markers

Emit counts + a reproducible audit trail:

sample rows that match each bucket
hashes of inputs/outputs
“if you rerun, you get the same result”

LLM is the interpreter. CPU is the accountant. Never swap those.

A simple example (that scales)

Let’s say you have 5,000 support tickets and you ask:

“How many are billing vs bugs vs feature requests?”

Bad approach:

paste all tickets into the model
ask for totals
trust the numbers

Disciplined approach:

store tickets as rows
ask LLM to label each ticket (billing/bug/feature) with a confidence tag
use CPU to count labels exactly
output: totals + 20 example tickets per category + rerun command

Now your result isn’t “a vibe.”

It’s a report you can audit.

Why attention fails at counting (plain-English)

LLMs don’t “loop” over tokens the way your code loops over rows.

They compress patterns, and compression is approximate.

Counting is not pattern-completion.

Counting is enumeration.

So the fix is not “think harder.”

The fix is: use the right tool.

The MrBeast-style challenge (viral loop)

Let’s do a public sparring test.

Challenge: paste a long chunk of text (or a dataset summary) and ask for an exact count or aggregation.

I’ll reply with:

the minimal parse schema
whether you need LLM classification (or not)
the exact Counter() aggregation strategy
what receipts you should log so a skeptic can rerun it

Comment: COUNT and I’ll drop the “Counter-Bypass Card” you can paste into your agent prompt.

Bonus: the Tower framing (makes it memorable)

This is Floor 2: Foundation in the Stillwater Tower.

If your agent can’t beat the Counting Dragon, it shouldn’t be touching:

analytics
finance
ops dashboards
compliance summaries
incident postmortems

Because those domains punish “almost.”

The point (one line)

Long context isn’t memory. It’s a bigger whiteboard. If you want correctness: determinism + receipts.

Endure. Excel. Evolve. — Phuc Vinh Truong