When One Developer Outworked a Team of Five.
A 12-month data platform build was three months from failure. One developer and a Spark SQL pivot delivered go-live within 1% of historical Acxiom benchmarks.
Nine months into a twelve-month project, the batch jobs powering a major fashion retailer's new customer data platform were taking 26 hours to complete. The jobs needed to finish in 24. The team of five Informatica developers responsible for building them had no clear path to fix it. Go-live was three months away.
That's when the client made a call most executives in that position wouldn't have the courage to make: scrap the ETL layer entirely and rebuild it from scratch.
Two people from our team had been embedded on this engagement since the start. One senior Spark developer had joined three months earlier to solve a single edge case. That team of three rebuilt everything in time. Go-live landed on schedule, with customer data reconciled to within 1% of the external benchmarks they were replacing.
22
Customer Data Sources
12 months
Engagement Timeline
5 FTEs
In-House Developer Team
THE PROBLEM
The client is a major fashion manufacturer and retailer operating globally. For years, they had relied on Acxiom to manage their customer data. Audience creation, customer health scoring, marketing segmentation: all of it was owned and operated externally. The client didn't own the data logic. They contracted it.
That arrangement works until it doesn't. The decision to in-house was a strategic one. The plan was to build a new Hadoop data lake to consolidate and structure customer data before it reached the marketing mart. They had an Informatica license and in-house developers who knew the tool. The scope was to ingest all 22 customer data source systems, replicate the logic Acxiom had been handling, and model the output into a format that could power IBM Campaign running on a Netezza custom mart.
The scope was significant but not unreasonable. Tens of millions of customer records. Twenty-two sources. One cohesive output.
WHAT WE FOUND
Our team was brought in for staff augmentation: two people embedded to help with business process documentation and data mapping. We sat in on meetings with business stakeholders. We translated how they had historically provided data to Acxiom into mapping documents the development team could use. Twenty-two sources into one consistent file. The mapping work itself was not complicated.
Testing was where the project showed its true shape.
Most of the initial test cases were straightforward. A handful of defects, nothing alarming. But the customer email table was a different problem. Email addresses were present in roughly 80% of the 22 source systems. Each of those systems operated on a different schedule. Some refreshed every 15 minutes. Others ran daily or weekly. That alone introduces sequencing complexity. But email data at a globally operating company also carries GDPR obligations: the right to be forgotten, opt-in and opt-out compliance, and a requirement to evaluate the effective date of an action, not the date a record arrived in the system.
If a customer opted out on April 10th but the record reached the system on April 15th, the platform had to honor April 10th. That means reading event history across 22 sources, resolving conflicts between them, and always surfacing the most current compliant status.
Record arrives
System receives customer data from one of 22 sources on its own cadence
Read effective date
Evaluate when the action occurred, not when the record was received
Compare source histories
Resolve conflicts across all 22 source records
Apply compliance rules
Honor GDPR opt-in/opt-out status as of effective date
Surface current status
Write the most recent compliant email status to the customer record
That is not a mapping exercise. It requires recursive logic, sequencing awareness, and business rule clarity that sits outside any GUI-based ETL tool's comfort zone.
The in-house Informatica team said the customer email table was ready for testing. Ninety-eight percent of the test cases failed.
We had a direct conversation with the client stakeholder. If this pattern continued, go-live was at serious risk. The test scripts we had written in Spark SQL to validate ETL output were better positioned to do this job than the Informatica jobs themselves.
The client made a pragmatic call. Use Informatica for everything else. Make an exception for this table. The team grew from two to three. We brought on a senior Hadoop-native developer. The client executive took on testing personally, keeping a clean separation of duties.
But the email table wasn't an isolated problem. The same failure patterns surfaced across other Netezza customer objects. Then the performance issue landed: batch jobs that needed to complete in 24 hours were running for 26.
WHAT WE BUILT
We diagnosed the performance problem. The answer was in how the in-house team had built their Informatica jobs, and where a critical skill gap had formed.
Informatica has a push-down mode. In push-down mode, it delegates compute work to the underlying data platform rather than processing on its own servers. But the in-house team was using transformation functions that had no logical equivalent in Hadoop. Informatica couldn't push the logic down. Every job followed the same pattern: pull data into the Informatica server, transform it using constrained compute resources, then write it back to Hadoop. A query that would take 30 to 45 seconds running natively on Hadoop was being routed through a bottleneck.
Before
- 1Pull data into Informatica
- 2Process on constrained compute
- 3Write back to Hadoop as CSV
After
- 1Run transformation natively on Hadoop
- 2Write optimized Parquet/ORC format
- 3No data movement overhead
There was a second issue. The team had configured Informatica to write CSV files rather than Parquet or ORC. When test cases failed, developers could open a CSV in Excel and inspect the data directly. But it made every query slower, because the entire cluster was operating against uncompressed, unoptimized flat files.
These weren't failures of intent. They were the predictable result of applying an ETL-era tool to a Big Data environment without the bridge skills to connect the two. Knowing how to use Informatica is not the same as knowing how Informatica should behave on a Hadoop-native cluster.
Nine months in, the client reached a conclusion: the system they had built was brittle. The tool choices, the configuration decisions, and the skill constraints had compounded. Rebuilding on a different foundation was less risky than trying to stabilize what existed.
The team of three made the case to take it on. We had written every mapping document. We had written every test case. We knew the business rules behind every transformation. And we had the developer who had already proven the Spark SQL approach with the most complex business rules and transformation logic in the schema.
That developer rebuilt the entire ETL layer in Spark SQL using native Hadoop orchestration in the three months remaining before go-live.
The rebuild wasn't without issues. Our developer introduced bugs. But because roughly 70% of the original Informatica work had already passed UAT, we had a system of record. We ran like-for-like comparisons between the Spark SQL outputs and the approved Informatica outputs to accelerate validation. When numbers diverged, we investigated. Sometimes our developer was wrong. Other times, the investigation surfaced real data quality problems the original testing had missed entirely: status flags set incorrectly, opt-in and opt-out sequences misread, and gaps in how historical transactions were tied back to customers identified across systems. Reconstructing that history required recursive restatement across sources. That kind of issue only becomes visible when two independent implementations try to reach the same answer and disagree.
3
ATN Team Size at Go-Live
3 months
Full Rebuild Window
< 1%
Reconciliation vs. Acxiom
OUTCOMES
Go-live landed on schedule.
When the new platform was reconciled against the historical benchmarks from Acxiom, customer counts and key data quality metrics came in within 1% of the external numbers. For an in-housing initiative of this scale, with 22 source systems and tens of millions of customer records, that is a significant result. One developer replaced a team of five. The three-person team that started with mapping documents finished with a production system.
WHAT THIS MEANS FOR OTHERS
The most important lesson from this project had nothing to do with Spark SQL versus Informatica. It was about what happens when the people doing the technical work also understand the business.
A team of five developers with deep Informatica experience built a system that failed under the weight of its own complexity. Not because they weren't skilled. Because the tool they knew wasn't the right fit for the problem, and no one could articulate why until the test failures made it impossible to ignore.
The two-person team that came in for mapping ended up delivering the entire engagement. That only happened because they were in every stakeholder meeting. They understood the GDPR rules, the sequencing logic, and the business reason behind every transformation. When the senior developer joined, that context transferred immediately. The developer had the technical skills. The context was already there.
There's a version of this pattern that applies directly to how teams are now using AI in data engineering work. If you're rebuilding or validating complex data logic, consider running two agents simultaneously toward the same objective: one building the tables using one approach, the other using a different tool or model entirely. Have a single developer reconcile the differences. The disagreements between outputs won't just surface bugs. They'll surface ambiguities in the business rules that thorough UAT wouldn't catch, because the tester and the developer share the same assumptions about what the output should look like.
Different teams, different tools, same goal. That's not redundancy. That's stress-testing your own logic.