Loading
Loading
A data audit revealed €1.2M in annual revenue being misattributed across 3 disconnected databases with no shared key. Here's exactly what we found and how we fixed it.
Author
Pavel Siddique
Published
21 May 2026
Reading time
9 min read
Topics
data-engineering, data-platform, enterprise
The revenue was there. The customers were there. The transactions were recorded. But €1.2M per year was being optimised for the wrong acquisition channels because three databases — CRM, billing, and product — had no shared key linking them. Nobody had noticed because every individual system was working correctly. The problem only existed in the space between them.
When people hear "disconnected databases" they imagine broken systems or
missing data. The reality is more subtle and more dangerous. Each of the
three databases was functioning correctly. The CRM held customer
acquisition data. The billing system held revenue. The product database
held usage, cohort, and churn data. The problem was that there was no
reliable shared identifier linking all three — no single customer_id
that meant the same thing across systems.
The CRM used its own internal ID. Billing used a contract number. The product database used an email address as the primary key — which changed when customers updated their contact details, and sometimes duplicated when a single company had multiple seat-holders. When the analytics team tried to answer "which acquisition channel generates our highest-value customers," they were joining tables on best-guess fuzzy matches. The joins were technically valid. The conclusions were not.
The marketing team had been doubling down on a paid search channel for 14 months based on this data. The channel appeared to generate customers with 40% higher LTV. The true picture: that channel attracted customers in a specific industry segment who happened to be high-value. The acquisition channel itself wasn't the driver. The company had been misallocating marketing budget for over a year.
A data audit doesn't start with databases. It starts with questions the business is trying to answer and works backwards to whether the data actually supports those answers. Our first session with this client's CTO and Head of Data produced a list of eight business questions they relied on for strategy. We then asked: "Show us the query that answers each of these." Four of the eight either didn't have a query or had a query that depended on a join logic nobody had documented.
The join between CRM and billing was done by company name — matched with a LOWER() and TRIM() function to normalize casing and whitespace. That works until a company rebrands, is acquired, or has a slightly different legal name in the two systems. We found 340 customer records where the join silently failed and defaulted to NULL, which the analytics layer treated as "unattributed" and excluded from channel calculations. Those 340 records contained the most valuable customers in the cohort.
The join between billing and the product database used email. Email changed for 18% of customers over a 24-month period — account migrations, domain changes, role changes. Each email change broke the historical join. We found five cohorts of customers who were treated as new customers in the product analytics but had 18+ months of billing history. The churn model was trained on this data. It consistently underestimated churn risk for a specific customer type because half of their historical signals were invisible to the model.
"Every system was accurate in isolation. The €1.2M problem didn't exist in any single database — it existed in the assumption that you could join them. That assumption was wrong." — Pavel Siddique, CEO, Indpro AB
The full audit produced 47 data quality issues across the three databases. Not all of them were revenue-critical. Triaging the 47 into actionable tiers took one working day. We scored each issue on two dimensions: revenue impact (how much money was at stake if left unfixed) and fix complexity (how long to resolve). The resulting 2×2 matrix was clear: three issues sat in the high-impact, low-complexity quadrant and were causing the €1.2M misattribution. Those three went first.
| Issue Tier | Count | Revenue Impact | Fix Complexity | Priority |
|---|---|---|---|---|
| Critical | 3 | €1.2M+/year | Medium (2–3 weeks) | Immediate |
| High | 9 | €50K–200K/year | Low–Medium | Sprint 2 |
| Medium | 18 | Operational impact | Medium | Backlog |
| Low | 17 | Reporting accuracy | Low | Hygiene |
The three critical issues all traced back to the same root cause: no canonical customer identifier agreed upon at system design time. Each database had been built independently over four years. The CRM was bought off-the-shelf. Billing was custom-built by a contractor. The product database was built by the in-house team. Nobody owned the cross-system data model — because nobody thought cross-system joining would be a core analytics requirement until it was.
Running on data you haven't stress-tested? Our data audit process finds what you can't see from inside any single system.
Talk to Our Data TeamRead: The 47-Issue AuditThe solution was not technically complex. The work was. We designed a
global_cust_id — a company-level identifier that existed outside all
three systems, maintained in a lightweight reference table, and
propagated to each system via a nightly sync job. Every new customer
record gets a global_cust_id at creation. Every historical record was
reconciled using a four-pass matching algorithm: exact match on tax ID
(found in both CRM and billing), then fuzzy match on company name with
human review above 85% confidence, then manual resolution for the 40
records below that threshold.
The reconciliation took four weeks. Two engineers and one analyst. It was not glamorous work. It was also the most commercially valuable four weeks this team had spent in two years. Once the shared key existed, every join became exact, every cohort became accurate, and the churn model got retrained on clean data. The paid search channel that had looked like a high-LTV driver dropped to average. The channel they'd been under-investing in — direct and referral from a specific industry — was actually generating 60% of their highest-LTV customers.
The marketing team redirected 30% of paid search budget to referral programmes and direct outreach in the identified industry segment. That's the real value of fixing data: not the report that looks different. It's the decision that changes.
You don't need a €1.2M problem to make this worth examining. If your business runs on more than two operational databases that were built at different times by different teams, the question is not whether you have cross-system join issues — it's how severe they are. The fastest diagnostic is this: pick your most important business metric. Find the query that produces it. Count the join conditions. If any join uses a field that can change over the customer lifecycle (email, company name, phone number), you have exposure.
A shared primary key — maintained in a reference table that predates all operational systems — is the correct fix. Building it retroactively takes 3–6 weeks depending on data volume and system access. It's not the kind of project that generates a ticket or gets prioritised in a roadmap sprint. Which is exactly why it rarely gets done until someone finds the €1.2M.
The diagnostic question: Pick your most important revenue metric. Find the query. If any JOIN uses a mutable field — email, name, phone — you have the same exposure. Fix the key, not the query.
A canonical identifier solves the cross-system join problem but doesn't fix data quality within each individual system. Of the 47 issues we found, 44 required additional work beyond the shared key project. Budget for both: typically 4–6 weeks for the key reconciliation, then a second phase for the remaining quality issues in priority order. Don't expect the first phase to solve everything — but do expect it to make everything else measurable and fixable.
The other limitation: this kind of audit requires access. Access to query all three systems, access to the people who built them, and access to the business stakeholders who can confirm what "correct" looks like for each metric. If your data team is siloed from the commercial team, the audit produces findings but not resolution. The fix is a cross-functional project, not a data engineering project.
How do you identify which data quality issues are revenue-critical vs. operational?
We start with business questions, not database tables. For each strategic metric the business relies on, we trace the query and identify every assumption in the join logic. Issues that affect a query used for resource allocation or investment decisions are automatically high-priority, regardless of their technical severity.
How long does a typical data audit take?
The audit itself — scoping, access, querying, issue identification, and triage — takes two to three weeks for a three-system environment. The reconciliation work that follows depends on data volume and system accessibility, but typically runs four to eight weeks. See our full breakdown in the 47-issue audit post .
Is €1.2M a typical finding, or was this an unusually large problem?
It's at the high end of what we typically find, but not unusual for a company with 3–5 years of independent system growth and no dedicated data engineering function. The €200K range is more common for smaller companies. The pattern — no shared key, joins on mutable fields, silent NULL exclusions — appears in roughly 70% of the companies we audit.

CEO & Co-Founder
Pavel founded Indpro in 2010 with a vision to bridge Nordic engineering culture with India's deep tech talent pool. Based in Stockholm, he oversees strategy and client relationships.
Connect on LinkedIn →10 pages of practical insight on operating models, compensation benchmarks, and a hiring playbook. Free PDF.
Download the Free GuideOr reach us directly: sales@indpro.se · +46 73 932 21 38