Ground Truth.
A two-person team resolved in three weeks what AI couldn't crack in three months. The method: go to the store.
They had been trying to get the discount numbers to reconcile for three months. It took one day to understand why they couldn't.
3
Months stalled
1
Days to identify
3
Weeks to resolve
THE PROBLEM
A large retailer had just migrated to a new point-of-sale system. The old system aggregated by default. Six cartons of milk at the register came back as "milk, quantity 6." A coupon came back as a single line with a total value. Clean. Simple. But, lacked visibility into what was actually happening at the register.
The new system respected the full sequence of events. Each item scanned individually. Each adjustment nested under the specific item it applied to. Every transaction was now a precise, timestamped record of exactly what a customer did. This was the design intent, and it was correct.
The problem was that their analytical workflows had been built on top of the old system's aggregated outputs. The old system handed them answers: total discount, coupon count, adjustment impact. The new system handed them the inputs required to calculate those answers. Nobody had built that calculation layer yet.
This wasn't an abstract gap. The retailer was behind schedule and under pressure. Their marketing models ran on promotional data. At roughly $9 billion in monthly revenue, with discount and promotional activity running 10-15% across the business, getting those numbers right wasn't optional. They were flying blind on close to $1 billion per month in promotional impact.
They tried using LLMs to accelerate the data profiling work. The numbers still wouldn't reconcile. That's when we were brought in.
WHAT WE FOUND
The first thing we did was open the data and look at it.
The confusion was immediate. There was coupon_flg, a boolean data type, that appeared to signal whether a coupon had been used in a transaction. There was also a discounts table and a separate adjustments table. Was a coupon a discount or an adjustment? Could you just use the flag?
Within the first 24 hours, we pulled thousands of transactions where coupon_flg was set to true and there were objectively no coupons involved. We pulled thousands more where the flag was false and coupons were clearly present. The flag was unreliable.
This wasn't a data quality failure in the traditional sense. The POS system was behaving as configured. POS systems ship with default classification logic that works for most of their customers. For this retailer, the mapping between those defaults and their specific business rules didn't hold. Nobody had documented the gap.
That distinction matters. The system wasn't broken. The context was missing.
info
A model goes in, sees coupon_flg: true or false, and draws the obvious inference. Without additional context, without evidence to the contrary, it uses what's there. That's not a model failure. That's a model doing exactly what any analyst without specific domain knowledge would also do. The problem was never the AI. The problem was missing context. And there was only one way to build it.
WHAT WE BUILT
We went to the store.
Not metaphorically. We went around the office and collected every form of discount and coupon we could find. Loyalty app codes. The retailer's branded debit card, which offered a flat 5% discount on any transaction. Manufacturer coupons pulled from previous receipts. Physical coupons from aisle dispensers. Social media promo codes. Gift cards. We gathered a physical representation of every discount type that could reasonably show up in the data.
Collect all discount types
Loyalty app, debit card, manufacturer coupons, aisle coupons, social codes, gift cards
Run 15 test transactions
Each designed to stress-test a different combination of discount types
Color-code receipts by type
Different colors for different discount categories
Trace to warehouse
Match each receipt line to warehouse rows via store ID, register number, and timestamp
Build the catalog
One version for human search. One optimized for agentic query.
Then we went shopping. Fifteen transactions, each designed to stress-test a different combination. What did the data look like when a loyalty purchase was paired with a gift card? What happened when item-level coupons stacked with a transaction-wide debit card discount? What did a pure price adjustment look like versus a manufacturer coupon?
We came back with receipts. We color-coded them by discount type. Then we traced each transaction back into the warehouse using the store ID, register number, and receipt timestamp. We mapped every highlighted line on the physical receipt to the corresponding row in the data.
The result was a complete picture: every discount type, every relevant column, every edge case, grounded in real transactions with physical receipts as the source of truth. We built two versions of the resulting catalog. One structured for human reference. One optimized so agents could query transactional data accurately without repeating the groundwork.
Three months of stalled reconciliation. One day to understand the problem. Two weeks to solve it.
What was true of the data turned out to be true of the team.
The retailer ran four analytics and BI pods in a US/India structure. Each pod had a US-based product owner and business analyst, and an India-based scrum master, three developers, and two testers. They had been running this model for about a year. Story point slip was running 40 to 50% per sprint. They were delivering roughly 60% of committed work each cycle.
The instinct here is to question the estimates. Maybe the stories were sized too large. Maybe points weren't scored accurately. We examined that. The sizing was reasonable. The work was scoped appropriately.
The gap was something else. The Indian teams were technically strong. But they were building analytics for a business they had never seen or experienced. The retailer didn't operate in India. A US analyst treats certain concepts as given. What a manufacturer coupon looks like. How a loyalty app works at the register. What the debit card discount feels like as a transaction. None of these had an equivalent in the Indian team's professional or personal experience.
In agentic terms, these concepts weren't built into their training data. And like any model operating outside its training distribution, they made assumptions. Assumptions compounded. Stories slipped. The same categories of issues recurred sprint after sprint.
We flew to India. Six people total: two from Motif, and the four US product owners. We brought the same color-coded receipts from the data profiling work.
We set up a fake store in the Indian office. We walked through what it looks like to move through the aisles, approach a register, hand over a physical coupon, tap a loyalty card, and use the debit card for the transaction-wide discount. We made it tangible.
Then we ran root cause analysis workshops. Each developer took turns describing, on paper, what information they needed and how they'd ask for it. The rest of the team acted as the computer and returned a result. We ran this across the most common recurring failure categories: data fan-out, duplication, non-unique identifiers. The goal wasn't to explain symptoms. It was to build the habit of asking better questions before reaching for a fix.
By the end of the week, they had three things they didn't have before: a concrete mental model of what a customer transaction actually looks like, color-coded mappings between physical receipts and data they could reference directly, and a shared vocabulary with their US counterparts that didn't require translation on every story.
OUTCOMES
The marketing model shipped.
Sprint efficiency moved from 60% to 90% and held for the remaining six months of the engagement.
60% → 90%
Sprint efficiency
~10%
Story point slip
6 months
Held for
$30K
Trip cost
The total cost of the India trip was roughly $30,000. Six people, economy class, one week, hotels included.
WHAT THIS MEANS FOR OTHERS
The same problem showed up twice in this engagement. A team trying to reason about something it had never directly experienced, making the best assumptions it could from abstraction alone. One team was a large language model working through an ambiguous schema. The other was a group of skilled developers on the other side of the world.
The answer was the same both times. Make the abstract concrete. Put something real in front of them. Build artifacts that capture it and last.
There's an instinct right now to solve every problem with more tooling, better models, and additional automation. Sometimes that's correct. But some problems exist because of a gap between what's real and what's been represented, whether in a data warehouse or in someone's mental model. Automation doesn't close that gap. Presence does.
Some problems still require physical presence. This was one of them.