Kimball University: Integration for Real PeopleKimball University: Integration for Real People
These step-by-step guidelines will help dimension managers and users drill across disparate databases.
Ralph Kimball |
"Integration" is one of the older terms in data warehousing. Of course, almost all of us have a vague idea that integration means making disparate databases function together in a useful way. But as a topic, integration has taken on the same fuzzy aura as "meta data." We all know we need it; we don't have a clear idea of how to break it down into manageable pieces; and above all, we feel guilty because it is always on our list of responsibilities. Does integration mean that all parties across large organizations agree on every data element or only on some data elements?
This article decomposes the integration problem into actionable pieces, each with specific tasks. We'll create a centralized administration for all tasks, and we'll "publish" our integrated results out to a wide range of "consumers." These procedures are almost completely independent of whether you run a highly centralized shop on one physical machine, or whether you have dozens of data centers and hundreds of database servers. In all cases, the integration challenge is the same; you just have to decide how integrated you want to be.
Defining Integration
Fundamentally, integration means reaching agreement on the meaning of data from the perspective of two or more databases. Using the specific notion of "agreement," as described in this article, the results of two databases can be combined into a single data warehouse analysis. Without such an accord, the databases will remain isolated stovepipes that can't be linked in an application.
It's very helpful to separate the integration challenge into two parts: reaching agreement on labels and reaching agreement on measures. This separation, of course, mirrors the dimensional view of the world. Labels are normally textual, or text-like, and are either targets of constraints or are used as "row-headers" in query results, where they force grouping and on-the-fly summarization. In a pure dimensional design, labels always appear in dimensions. Measures, on the other hand, are normally numeric, and, as their name implies, are the result of an active measurement of the world at a point in time. Measures always appear in fact tables in dimensional designs. The distinction between labels and measures is very important for our task of integration because the steps we must perform are quite different. Taken together, reaching agreement on labels and on measures defines integration.
Integrating Labels
To get two databases to relate to each other, at least some of the labels in each database must have the same name and the same domain. For instance, if we have a notion of product "category" in both databases, we can use category as a point of integration if and only if the category field in the product dimension in the two databases has the same contents (as when drawn from the same domain). With this simple criterion, we are astonishingly close to achieving integration! If we issue separate but similar queries to the two databases, such as:
Database-1: select category, sum(private_measure1) from fact1 ... group by category
Database-2: select category, sum(private_measure2) from fact2 ... group by category
Then in virtually all BI tools we can sort-merge the two answer sets returned from the separate databases on the category "row header" to get a multirow results set with the column headers
category, sum_private_measure1, sum_private_measure2
It's crucial that you appreciate the importance of this simple result. We have successfully aligned measures from two separate databases on the same rows of a report and at the same level of granularity; because the category label has been carefully integrated across the two databases, at a certain level the report makes sense regardless of the value of measure1 and measure2. Even without integrating the measures across the two databases, we have achieved a powerful, valid integration result.
This method of assembling results from two databases in a single results set, often called "drill across," has some other powerful advantages. Since the queries are launched to the two databases separately, the databases can be on separate data servers, or even hosted by separate database technologies. Both databases can manage performance independently. But even in the most centralized environment, we almost always have to execute this drill across against two independent fact tables. This approach has nothing to do with dimensional modeling; regardless of your modeling persuasion, it simply isn't possible to put multiple datasets of different granularity and dimensionality into a single table.
Integrating the labels of an enterprise is a huge and important step, but it's not easy and the main challenge isn't technical.
Getting the Enterprise to Agree
Using our product category example, there is bad news and good news in making sure that "category" has the same name and the same domain in multiple databases. The bad news is that as the data warehouse architect, you must convene a meeting of all stakeholders who would like to perform drill-across queries with the affected databases. You must reach 100 percent agreement among these stakeholders as to permissible category values and how these values are assigned to all of the products.
Before you dismiss this level of agreement as impossible, consider the good news that it's not absolutely necessary for any stakeholders to change any of their cherished product labels! If the group decides that category is a "conformed attribute" (a term for special labels that support integration), then the category label can be a newly introduced field that didn't exist in anyone's version of the product dimension. Such a narrow solution is, of course, disappointing, even if integration has been achieved, because category would be the only row header that could be used in drill-across queries! Far better that many labels could be found and administered to have the same names and domains across all the integrated databases (and in practice, this is more often the case). The majority of the important dimensions are filled with conformed attributes, but various stakeholders can insist on keeping some private labels. These nonconformed attributes can only be used as row headers or constraints on queries on their "home" databases.
Integrating Measures
After studying the drill-across query in the previous section, you might think that integrating measures between databases is not necessary. You would be right, with one big exception. You dare not do arithmetic combining the two numeric results (sum_ private_measure1 and sum_private_measure2) unless you have first carefully vetted the business rules defining these two measures. For instance, if private_measure1 is month-end actual revenue after taxes and private_measure2 is rolling accrued revenue before taxes, then it is probably misleading to add, subtract or ratio these two quantities, even though the drill-across query yielded measures at the same granularity.
For disparate measures to be combined analytically, the data warehouse design team must identify all measures-to-be-conformed at the same time the labels are being negotiated. If the back-room ETL can modify the measures so they can be combined analytically, then they should be named in such a way to alert analysts that cross-database arithmetic makes sense. These specially vetted measures are conformed facts. If facts that are similar cannot be conformed, then they need to be named appropriately to warn the analysts.
You now understand the full gamut of integration. Using dimensional terminology as shorthand, integration consists of conforming the dimensions and conforming the facts, or, to be more pragmatic, conforming enough of the labels and measures to support useful drill across.
Note that building a preintegrated physical database in the back room that avoids the drill-across approach adds absolutely nothing to the discussion. All you will have done is push these integration steps back into ETL processing.
Building a Master Dimension
Suppose that a large enterprise with many data sources commits to building an integrated view of its data. The crucial step is building a centralized set of conformed dimensions. This responsibility must be centralized because the definition, maintenance and publishing of these dimensions must be precisely coordinated down to the individual keys. A big organization will have many conformed dimensions. Theoretically, each conformed dimension could be administered independently and asynchronously, but it makes most sense if all the "dimension managers" work together and coordinate their efforts.
So what do dimension managers do? These individuals update and maintain the master dimension on behalf of the enterprise, and they periodically copy it into all of the destination database environments that use the dimension on one or more of their fact tables. Taking a step-by-step, technical view, the dimension manager:
1. Periodically adds fresh new records to the conformed dimension, generating new surrogate keys, but at the same time embedding the natural keys from the sources as ordinary fields in the dimension;
2. Adds new records for Type 2 changes to existing dimension entries (true physical changes at a point in time), generating new surrogate keys;
3. Modifies records in place for Type 1 changes (overwrites) and Type 3 changes (alternate realities), without changing the surrogate keys. Updates the version number of the dimension if either of these changes are made;
4. Supplies a dimension version number with the dimension reflecting minor (Type 2) and major (Types 1 and 3 and postdated entries) alterations;
5. Replicates the revised dimension simultaneously to all fact table providers.
Some of the dimensional modeling vocabulary here may be unfamiliar to some readers. Surrogate keys and slowly changing dimensions of Type 1, 2 and 3 have been described extensively in this column and in Kimball Group books (see "Required Reading").
The fact table provider mentioned in step 5 is the consumer of the dimension. In tightly run, centralized shops, the dimension manager and the fact provider could be the same person. But that changes nothing. The dimension must still be replicated to the destination databases and attached to the fact tables. Furthermore, this replication should occur simultaneously across all such destinations. If it is not simultaneous, a drill-across query could return a well-structured result that is wrong if different definitions of "category" co-exist in different versions of the master product dimension. Thus, coordination among fact providers is essential.
Responsibilities of the Fact Provider
The fact providers have a rather complex task when they receive the updated dimension. They must:
1. Process dimension records marked as new and current by attaching them to current fact table records, replacing natural keys with surrogate keys (so, you now appreciate the significance of the natural keys being supplied by the dimension manager).
2. Process dimension records marked as new postdated entries. This requires adjusting existing foreign keys in the fact table.
3. Recalculate aggregates that have become invalidated. An existing, historical aggregate becomes invalidated only when a Type 1 or Type 3 change occurs on the attribute that is the target of the aggregation or if historical fact records have been modified in step 2. Changes to other attributes do not invalidate an aggregate. For instance, a change in the flavor attribute of a product does not invalidate aggregates based on the category attribute.
4. Bring updated fact and dimension tables online.
5. Inform end users that the database has been updated, notifying them of major changes including dimension version changes, postdated records being added and changes to historical aggregates.
The design of the ETL pipeline supporting these steps is carefully laid out in The Data Warehouse ETL Toolkit (see "Required Reading").
The version number supplied by the dimension manager is an important enforcement tool for ensuring correct drill-across applications. For example, if the version number is embedded as a constant field in every dimension record, this field can be added to the select list of queries in a drill-across app so the BI tool can't align the rows of results inappropriately.
Summary
This article has described the data integration task in a specific, actionable way. Hopefully some of the mystery surrounding integration has been reduced and you can take this task off your to-do list so you don't have to feel guilty.
The vocabulary of dimensional modeling brings clarity and structure to the otherwise amorphous topic of integration. Key terms used in this article include:
* Conformed dimension. A master dimension for which contents have been agreed to by all parties in the enterprise.
* Conformed fact. A measure for which a business definition has been agreed to by all parties in the enterprise so the fact can be used in analytic computations across separate data sources and with other conformed facts.
* Surrogate key. A simple integer, usually assigned sequentially, serving as the primary key to a dimension.
* Drill across. A query strategy for producing integrated answer sets from multiple databases.
Ralph kimball, founder of the Kimball Group, teaches dimensional data warehouse and ETL design through Kimball University and reviews large warehouses. He has four best-selling data warehousing books in print, including The Data Warehouse ETL Toolkit (Wiley, 2004) . Write to him at [email protected].
REQUIRED READING
The Data Warehouse Toolkit, 2nd Edition by R. Kimball and M. Ross (John Wiley & Sons, 2002).
The Data Warehouse ETL Toolkit, by R. Kimball and J. Caserta (John Wiley & Sons, 2004)
More than 100 free articles on dimensional modeling topics and conformed dimensions and conformed facts may be found at kimballgroup.com.
About the Author
You May Also Like