People often ask me whether data quality or data quantity is a bigger problem. I therefore decided to write a blog post on this topic.
From Lack Of Information To Information Explosion
What Changed?
The term ‘information explosion’ describes the rapid increase (which we have experienced in recent years) in the amount of available information or data, introducing the challenge of managing the information and making sense out of it. Some may argue that we do not necessarily say or write more than in the past. But while in the past their words were not recorded, nowadays they write and talk more electronically. As a result, information can be replicated and distributed easily, making it available to much larger crowds than in the past. An online blog like Insights Unboxed is an example. Twenty years ago if I wanted to reach a large audience, I needed to find a journal or a magazine (often a paper one) that would publish my article, and distribute it to its subscribers. But because magazines had very limited space, the publishing opportunities were scarce, and hence my ability to reach an audience was limited. The Internet has changed that.
Too much information?
We are generating huge amounts of data at an amazing rate. In the past people were challenged by not having sufficient information (which gave rise to a service industry that profited from the lack of information). Nowadays the opposite is the case: the sheer amount of available data results in difficulties to identify what is truly important (among the noise), what is true vs. what is “fake news”; not to mention how to analyze these larger amounts of data to create insights that are not straightforward: intelligence.
To conclude: The need for insight and information today is similar to the past; but a key difference is the challenge on collecting data, processing it and analyzing it.
What Is Data Quality, And What Is It Not?
Data Quality: Myths
Some myths (wrong, thus) about data quality:
- “Data quality is a matter of the IT department”. Why wrong: because while data is captured in IT systems, it’s the business that suffers from bad data and benefits from good data.
- “Master Data Management (MDM) is a matter of the IT department”. Wrong for the same reason as the previous one.
- “The problem is too big to solve”. Wrong because like for many problems, certain aspects have simple solutions, while other aspects require a more complex solution.
- “Our data is enough; we do not need external data”. Wrong. I have not yet met the organization that could not benefit from external data. Even organizations like Tax agencies, which typically have vast amounts of data, still miss certain critical insights.
- “We can use data that is available on the Internet”. This would work for certain goals, but not for others. You should always assess what your goal is, subsequently which data you need for achieving this goal. And then you can assess whether data on the Internet is suitable.
- “We know our customers well”. Some do. But I’ve seen so many organizations that do not really know their customers that well, and the problem starts with not having a helicopter view on your customers, because information about them is captured in siloed IT systems that do not communicate.
- “Data isn’t like a car; there is no reason why one supplier would have a higher price than the other for the same data”. Wrong. The mistake in this sentence starts with the assumption that it’s “the same data”. As an example, if you have several sources for a weather forecast service, it seems like they deliver the same data (weather forecast), but in fact one service may be based on meteorological data from two days ago, the other service is based on meteorological data from half an hour ago, and a third service is based on statistical analysis of historic weather data, and not at all on recent meteorological data.
Data Quality: Challenges
Organizations across all industries deal with the same recurring types of data quality challenges. For example:
- Data is not available (yet required for making critical business decisions)
- Data is not available at the right time
- Incomplete data
- Inaccurate data
- Old data; it hasn’t been updated
- Data is available in inconsistent syntax or format, making data processing impossible
- Data is available in multiple IT systems, without standardization
- Redundancy: duplicate data
- Data is collected for specific reasons; am I allowed to use it?
- … (the list is long, but the above is a good list to start with)
Start With A Goal
Data is a means, not a goal. This is very important to remember, because it implies that one should not engage in data analytics “because everybody does”. One should engage because there is a specific business goal that can be realized through data analytics. Start by thinking about challenges (what does not go well) and opportunities (what do I want to achieve), and subsequently define what are the information needs for dealing with these challenges and opportunities. Now go search for this data! Next, you need to examine whether the data that you have in mind is qualitative enough. We will address several recurring problems and their solutions.
Recurring Problems, Recurring Solutions
Across all sectors and all industries, most if not all the organizations that I came across throughout my professional career can benefit from two specific solutions for two specific problems.
Lack of Insights
Lack of Insights: The Problem
The first recurring problem is lack of insights about companies that they work with: customers, suppliers, partners. The lack of insights results in increased risk exposure, leading to:
- Large write-offs (when a client defaults before having paid their bills), to supply chain disruptions (when a supplier defaults and is unable to deliver the products or services that you rely on);
- Reputational damage (when a company you work with appears to be fraudulent, or to have bad press, or criminal history);
- Regulatory fines (when a company/person related to your customer appears on a sanctions list, and you should have known it).
Lack of Insights: The Solution
The solution: rather than trying to solve the challenge by yourself, which is not your core business and therefore not your core strength, focus on what you’re good at, and obtain this data externally. Companies such as Altares Dun & Bradstreet (BeNeLux website; U.S. website) specialize in delivering precisely these insights.
My Data Is A Mess
My data is a mess: The Problem
A second recurring problem is having chaos in your own IT systems. A bit embarrassing? Not really, because so many organizations deal with the same challenge. You’re not alone; so many others have redundancy in your IT system, or obsolete data, or incorrect data, or…. I’ve spoken to a client who told me that they had 50 million company records across multiple ERP systems, yet they expect to have only roughly 4 million unique companies. They were unable to solve the problem because there was no way to detect that two records were the same. Records (of a single company) in their ERP systems often had somewhat different names, or different addresses (e.g. PO Box vs. mail address vs. visitors address), and some companies have changed names or even legal structures due to Mergers & Acquisitions. Now I must say: not many organizations have databases with 50 million records. But the interesting point is that whether you have 1000 records or 50 million records, you’re dealing with the same problem, and the solution may be the same (with some technical differences in the implementation).
My data is a mess: The Solution
The solution: Large organizations may try to solve this problem by deploying Entity Analytics software. While this is a good solution, it is hard to implement, and the time-to-value is likely to be long. The solution that still amazes me (again and again) in its simplicity and speed is Referential Matching, a patent-supported method by Dun & Bradstreet. In its core it’s very simple: Since D&B knows the names (current name, previous name, trade name), the addresses (official name, PO Box, mailing address,…), phone numbers, etc of companies, they can compare each entry in your database to their huge database. Thus instead of comparing entries in your database to each other (as Entity Analytics software does), Dun & Bradstreet will compare it to their own database, serving as a reference. And once two (or more) entities match to the same entry in the D&B database, this means that they refer to the same entity, thus you found redundancy (some extra benefits come as a “side effect”, e.g. detecting entities that have gone out of business already). At this point you add to each of these entries in your database(s) the D&B DUNS Number, the unique identifier of an entity, and this DUNS Number serves as the linking pin in your system, allowing you to identify all the entries of a single entity. That’s how you create a helicopter view of what you know about a customer, or a supplier.
Conclusion
Data quantity is a problem of the past. Today’s challenge is clearly data quality. In this blog I listed several aspects of the data quality challenge. These apply to all industry: Finance, Logistics, Manufacturing, Government (public sector), Professional Services, Travel & Transportation, and any other type of organization (we’re all unique, but also similar). I have addressed particularly two key challenges and their solution. The first challenge is around not having the right data (or no having it on time, or with insufficient quality). The solution is to obtain the data from external sources. Similar to buying other services that aren’t your core capability, data can also be obtained externally, from a company whose core business is collecting, processing, analyzing and selling high-quality data. The second challenge deals with not being able to identify clearly who you do business with, in a way that avoids redundancy and inconsistencies. The solution is – again – to seek for help externally, and to use a unique identifier which is being maintained by a provider whose core business is to maintain the accuracy of this data, i.e. monitor and update it daily.
The Million Dollar Question
Are you asking yourself how long would be the implementation time for such initiatives? (Almost) Nobody likes long and complex projects. I also had the same thoughts until several years ago. Until I learned about Referential Matching. Implementation time can be anything between a few weeks and several months, not more. So what are you waiting for?
Recommended Reading: