I Wrote 16 Pages of Specs. The AI Still Got Down-and-Distance Wrong.
What a broken NFL simulation engine taught me about domain expertise, data transformation, and why architectural simplicity is something you enforce, not something agents choose.
By Phil Hall
A thousand simulated NFL games. Neither team committed a single penalty in any of them. Drives started on second down. Teams threw incomplete passes on fourth and ten. I had handed an AI hundreds of columns of play-by-play data and a confident prompt. It gave me something that looked like football and played nothing like it.
What most people get wrong
The demos are convincing. Tools like Lovable and Replit will produce a working app from a typed prompt. When large models demonstrate one-shot software, the message is implicit: domain knowledge is now optional. Just describe what you want. Let the model figure it out.
It works for UI. It fails for logic. I built a functional-looking simulation UI in one prompt. Team selectors, spread input, win probability output, a reasonably nice interface. If I had stopped there, I might have believed the whole thing.
What we've actually seen
I tried to build a full NFL game simulation engine using Claude, starting from a GitHub repository with play-by-play data on every game in the 2025 season. The raw dataset had hundreds of columns. The prompt was simple: here is the data, you know football, build a play-by-play simulator.
The first version looked convincing at the individual drive level. First and ten. Incomplete pass. Nine-yard run. Third and one. Punt. The sequence tracked. Then I ran a thousand simulations and looked at the aggregate.
56–63
Avg plays per team per sim
19–16
Avg simulated final score
0
Total penalties in 1,000 games
Down-and-distance logic was broken at the fundamental level. I spent time prompting fixes. Ten iterations. Nothing held. The underlying architecture was wrong in ways that couldn't be patched from outside.
Why it works that way
The problem was not the model. The problem was the input.
I handed it hundreds of columns and expected it to know which ones mattered, what the relationships between them meant, and how a football game actually flows. It didn't. It invented formulas. It hardcoded fallback values for red zone rates and punting. It treated every game situation as roughly equivalent. The difference between a two-minute drill and a clock-grinding fourth quarter disappeared entirely.
The data needed to be transformed before the model touched it. Not cleaned. Transformed.
I built a gold layer that compressed those hundreds of columns into a smaller, purpose-built set. Each raw field got a deliberate interpretation:
| Raw field | Transformed to |
|---|---|
| Score differential (integer) | Bucket: winning by 2+ scores / 1 score / tied / losing |
| Yards to go (integer) | Short / medium / long / very long |
| Time remaining in half | Boolean: almost end of half |
| Down number | Early down (1st–2nd) / late down (3rd–4th) |
| Distance to end zone | Red zone / fringe field goal / open field |
| Play duration per team | Pace classification: fast / neutral / slow |
These aren't arbitrary categories. They reflect how football works. A team down by 30 in the fourth quarter behaves nothing like the same team tied at halftime. The model had no way to know that from column names and raw values.
That transformation took hours. It required actually understanding the game.
After the transformation, I wrote a 16-page specification covering simulation architecture, a full data dictionary, fallback logic for thin sample sizes, formulas for blending team-specific rates with league averages, and rules for every game situation from kickoffs to safety scores.
Then, before any development, I gave the spec to Claude and asked it to review the document as a software analyst with football expertise. It found real gaps. Missing safety rules. Incorrect fumble recovery rates. Ambiguous variable naming. I fixed those issues in the spec first.
Even with all of that, the first coded version still broke. Down-and-distance was wrong. Kickoffs were missing. Field goals never happened.
Those errors were not documentation problems. They were environmental. In a single-model context with no feedback loop, a bad assumption made early propagates through every downstream calculation before you ever see the output.
How to act on this
The next thing I tried was a multi-agent workflow. Seven specialized agents handled discrete phases: clarifying questions, spec writing, codebase research, implementation planning, QA review, development, and code review. Three different frontier models. Human approval at key gates.
The workflow ran. Features got built. Then the same pattern: statistics that worked for some teams and not others. Offensive penalty rates correct. Defensive rates wrong.
I asked the system to diagram what it had actually built versus what it should have been.
Before
- 1Pass offense stats → pass offense pipeline
- 2Rush offense stats → rush offense pipeline
- 3Pass defense stats → pass defense pipeline
- 4Rush defense stats → rush defense pipeline
- 5Tendencies stats → tendencies pipeline
- 6Each new category → another new pipeline
After
- 1CSV data
- 2Stat registry (all categories defined in one place)
- 3Config file (fallback logic mapped per stat)
- 4Single transform applied to every team, every interface
Six agents across ten iterations produced sixty rounds of changes. Each round solved the immediate problem and added a new pipeline underneath. Uncoupling what they built took the same number of iterations as building it correctly would have. The multi-agent workflow didn't solve the complexity problem. It made complexity faster.
What actually worked was three things together.
Know the simplest correct system before you start. Every statistic in this codebase follows the same pattern: read from CSV, look up the right bucket, apply the team's rate. That maps to three components:
- One registry that defines all statistics and their fallback classifications
- One config file that maps each statistic to its game scenario buckets
- One consistent transform applied to every team and every interface
Enforce it with deterministic tools. Linters, scripts, and test harnesses that fail loudly when the pattern breaks. Agents drift toward new pipelines because nothing stops them. A linter that flags the violation makes it visible immediately instead of three iterations later.
Guide explicitly toward the target. Not by approving specs and hoping. By knowing the target architecture, stating it in the prompt, and reviewing the actual code to verify it landed. Not just the plan.
info
Autonomous agents execute well. They do not constrain themselves. They will build whatever the task permits, and what the task permits is usually more complex than what you need. The job is deciding the simplest correct solution before they start and enforcing it with tools.
Human judgment does not belong only at the spec checkpoint. It belongs in the architecture. It belongs in the toolchain. And it belongs in the code review. Not to catch bugs. To verify that what got built matches what was designed.
AI got this done faster than I could have alone. But only once the architecture was simple enough that sixty rounds of agent changes could not break it.