Data Quality DisciplineData Quality Discipline
Any single data element can quickly go from a simple transformation into an explosion of new requirements: Are you prepared?
We seem to be making the same mistakes over and over again with regard to data quality. In my July 10, 2004 column, "The Data Quality Audit," I outlined a six-step approach for conducting a data quality audit. But we also need to look at the bigger issues that interfere with effective data quality. Architects, project planners, customers, and vendors continue to misunderstand where, what, when, and how to address data quality.
I'll show how data quality and extract, transform, and load (ETL) are tied together by describing the transformation explosion and analytic thrashing that occurs with poorly understood source data. I'll also identify who's responsible for ensuring data quality: the vendor or the customer. And finally, I'll outline a few rules to help everyone avoid the setbacks that seem to explode when data quality problems surface.
Data Quality and ETL
ETL is still the most underestimated and underbudgeted part of most business intelligence and data warehouse iterations. And the biggest reason why the ETL portion of a project often raises more questions than it resolves is a lack of understanding of the source data, which creates a transformation explosion at each data element.
During data extraction and transformation, your team will encounter numerous issues, many of which will require your going back to the user group for further clarification:
Multiple meanings for the same data element. Once you've isolated the business requirement and started to gather the source requirements, you may find that a value defined by one user audience is different for other users who may be accessing the same data store. Some examples are values such as "profit," "cost," or "revenue." Each could have different meanings to different user communities. Consequently, you'll need to document each group's definition and determine whether you can implement transformation processes to satisfy each.
Multiple sources of data elements. Specific data elements often exist on multiple source systems. You must identify the various sources and discuss with the users which one is most applicable.
Differing levels of history. Another challenge you may encounter is availability of history. What if your business requirement calls for four years' worth of history but the best, most recent data contains only one year? The other three years would need to be extracted from other data sources, possibly of questionable quality.
Data cleanliness and accuracy. Warehouse data is never perfect. Instead, you need to ask yourself, "Is the data clean enough?"
Deintegration for audits and validation. So now that you've done a wonderful job at integrating, transforming, and cleansing the data, how do you separate the data for audit and validation purposes?
These descriptions shouldn't suggest a defined sequence of events. The message is that any single data element can quickly go from a simple transformation into an explosion of new requirements, irrelevant of the mix or order in which the issues are confronted.
And each uncovered data quality issue results in a significant amount of thrashing among the ETL team, project designers, and end users. (See Figure 1.) They must evaluate the new transformation issues before reaching a solution that can be sent back to the ETL programmers.
Figure 1: Analysis thrashing.
The project team designs a system to support a pretty pie chart. But if the project team and its designers never assess the quality of the source data necessary for that pie chart, then they're exposing the entire project to great risk. Consider this carefully: If no one spent any time assessing the source data quality, then you could purchase and install all the technology, do all the analysis, write all the source-to-target code to populate target tables, and still fail.
Who's Responsible?
Buyer beware is the best way to describe responsibility in this matter. It would be great if the vendors were constantly and explicitly handling data quality. It would be awesome to hear vendors tell clients, "Prior to your committing large resources to this effort, we want to conduct a data quality assessment of the source data specific to this project to ensure we can create the necessary target tables you need to meet your business requirements."
It doesn't matter if the client must pay a fee for this limited engagement. What matters is that this engagement mitigates the risk associated with sourcing data. It gives clarity to the task at hand and sets expectations for all involved. After all, if there's going to be any debate about how much the overall project will cost, how long it might take, and the risks involved, then that debate must be done now, at the beginning, before the big check is written and accepted.
Unfortunately, that isn't what typically happens. The client writes the big check before clearly understanding the quality of the source data and consequently, the size and scope of the effort. Still worse, the vendor anxiously accepts the check, equally in the dark about the true size and scope of the project.
Everyone is happy for the moment. Then the problems start to surface. The project planners and designers start receiving questions from their ETL jockeys — questions and issues that were never considered in the requirements gathering or addressed in the design. The worst time to uncover data quality problems is when your ETL programmers are hammering out code or during your user acceptance testing.
A Consistent Pattern
One of the world's largest consulting firms contracted me to assist with a few problem accounts. Projects were stalling, customers were becoming belligerent, and the consulting group was frustrated. At least from my perspective, the pattern was obvious and the connection clear. Data quality was being treated as an afterthought.
The project managers of this consulting group follow the traditional BI/DW approach. Business requirements are gathered using the infamous source-to-target mapping tool of excellence, Excel. A design is fashioned, and ETL programming is then initiated. And then the problems start to surface. The transformation explosion begins as described in the previous sections. Or worse yet, the problems surface after user acceptance testing begins. And nowhere in this big, best-of-class BI/DW approach is there explicit consideration for data quality assessment.
If the problems are small, the team puts in the extra effort to finish. If the problems are large and the effort to resolve them significant, sales reps make visits to the client to explain that source data quality issues will increase the time on the project and, of course, the cost. Now you have the client's attention. You mentioned money!
How the situation unfolds from here is predictable. We've all heard the arguments. There's plenty of blame to go around. The fact is that any client sufficiently naive to initiate a BI/DW effort without a clear understanding of source data quality and its impact on subsequent analysis and reporting deserves this problem. And vendors, knowing the risks heightened by source data quality, should be more responsible and help guide customers. This lack of consistent, conscious inclusion of a data quality assessment in any BI/DW iteration, either by the client or the vendor, is the biggest problem of data quality.
DQ Discipline Rules
When you think about it, is it even worth the risk of not doing a data quality assessment at the very beginning, as the first step of every project iteration? Figure 2 illustrates how a single data quality issue can grow into a significant, unexpected investment. The project expands, the company's costs spiral upward, and new software or adjustments to the project's objectives also might be required.
Figure 2: Data quality problem explosion.
Just imagine that you've told executives that this effort was going to cost x dollars, be completed by a certain date, and the company would have an analysis of a specific level. Now, only a few months into the effort, you need to go back to the executives and tell them you need more money, the project won't be completed until 60 days later, or the final analysis will be smaller than originally expected. Who needs this grief? Especially when it's avoidable.
Here are a few simple rules to follow that can help you avoid the problems that grow as a result of data quality issues.
Rule 1: Always perform a data quality assessment before committing to a project iteration. Whether the customer conducts its own assessment or contracts for a specific data quality assessment, it should be conducted so that any subsequent project can be accurately sized and scoped.
Rule 2: Focus on the audit and assessment, not technology. Once you mention data quality audit to the BI/DW team, the first thing asked is what software will be used. No, no, no. If you already have a data-profiling tool, you can use it for some of the audit/assessment. But the audit isn't just about profiling your data. If I were to identify the single most important result of an audit, I would have to say it is all about applying your business rules to your source data in order to make a specific target. The target feeds the data necessary for your user requirements (the pretty pie chart). So, if you can't take your current source data, apply your business rules, and make the required target, then you have unresolved data quality issues. End of story.
Rule 3: Run an audit as the very first part of any new BI/DW iteration. Any iteration. You must make sure you have the data necessary to achieve the task at hand. The scope of your assessment is determined by the user requirements being addressed in the iteration.
Avoid the Trap
Either the vendors lack the discipline to include data quality audits at the very start of every BI/DW iteration undertaken or customers naively believe that everything will be taken care of by the vendor. After all, that's what they paid for — or at least think they paid for. The client pays a mountain of money to a vendor and expects the vendor to get the job done. The vendor gladly takes the money and guarantees the implementation of hardware, software, data models, and processes to populate the models, but dodges the issue of data quality. Can you see the gap between what customers think they'll get and what vendors deliver?
And once you make data quality auditing/assessment cornerstone to your BI/DW iterations, avoid the technology trap! A data audit is just that, a data audit. It isn't a test of your ETL technology or skill. The reason I completely sidestep the technology is because virtually all teams have the propensity to make an audit into a technology event. And I for one do not want to waste my time or that of my client messing around with technology when I'm supposed to thoroughly assess the quality of the data with regard to the business requirements.
Michael L. Gonzales is the president of The Focus Group Ltd., a consulting firm specializing in data warehousing. He has written several books, including IBM Data Warehousing (Wiley, 2003). He speaks frequently at industry user conferences and conducts data warehouse courses internationally.
About the Author
You May Also Like