Set Disruptors ON FULLSet Disruptors ON FULL
New information integration technologies, with help from Moore's Law and increasing standards acceptance, are becoming today's disruptive technologies. Data warehouse incumbents are in for a surprise if they don't pay attention.
Andy Hayler of Kalido, in an essay published at Intelligent Enterprise's Web site ("EII: Dead on Arrival"), argued that enterprise information integration (EII) was a half-baked idea that ran roughshod over the essential disciplines of data warehousing. Hayler felt that benefits claimed by EII solutions, which advocate a federated rather than centralized approach to data warehousing, were ill-founded and poorly conceived (my words, not his). Specifically, he argued, EII does not address historical data, data quality, and the performance of queries against systems not optimized for decision support.
Although I tend to agree with Andy on these points, not all EII vendors are alike. There's more to the story than just EII. Unfortunately, his arguments sound like the classic claims of an incumbent challenged by a disruptive technology. The irony is that Kalido has been a disruptive force itself with regard to traditional practices of designing and maintaining a data warehouse. However, data warehousing itself is now facing disruption.
Over the past decade and more, each major data warehousing component has had its 15 minutes of fame. In the beginning, the attention was focused on databases and database servers, particularly on their ability to scale to large volumes of batch updates and heavy query processing. Following closely behind were extract, transform, and load (ETL) tools, and — ever so briefly in the limelight — metadata. Seven or eight years ago, data modeling grabbed the spotlight, only to be upstaged by business intelligence (BI) tools.
This cavalcade of technology stardom was in fact just a shifting of focus. Far from disappearing, the components just blended into the background, their roles secure. Today, vendors are consolidating their portfolios to put as many of these components as they can under a single brand name. Could they, however, be missing a larger trend? Fundamental advances in information integration are already disrupting the value proposition of most of these technologies.
Rationalizing data across an enterprise is a problem so hard that it's only been attacked piecemeal. However, the stars are aligning for real progress. First, Moore's Law continues to deliver unbelievable hardware resources to power solutions at increasingly affordable prices. Second, the development of the Internet has put us on the road to universal connection and access. Third, e-business — alive and well although somewhat less glamorous — has fueled the development of business integration technologies, which reside primarily in the enterprise application integration (EAI) technology bucket.
Fourth, when data warehousing gave rise to ETL tools, it put data quality and metadata issues on the front burner. Today's obsession with governance, disclosure, and regulatory compliance is accentuating the demand to solve these issues — but, the requirements are driving a change toward real-time reporting supported by EII.
Finally, other stars aligning include Web services, service-oriented architecture, and evolving information sharing standards based on XML, the Semantic Web, Resource Description Framework (RDF), and ontology — and what the newer and major enterprise application vendors are doing to take advantage of standards. Such steps include turning their application suites into modular components that come together through integration infrastructure; and second, introducing tools and methodologies for creating agile business processes, especially through employing the standard Business Process Modeling Language (BPML). The conclusion is that many existing approaches to information integration urgently require rethinking.
Pursuit of the Truth
Data warehouses extract data from nonconformed systems and perform excruciating data cleansing. In many cases, the errors found are corrected in the data warehouse but never in the feeder systems themselves. Thus, a Sisyphean task is repeated endlessly as new, equally dirty data sources are added.
The cost of populating a new data warehouse design combined with the cost of maintaining the warehouse accounts for a substantial proportion of IT budgets. Because systems tend to be acquired or developed at different times and for different constituencies, inconsistencies introduced make integration even more difficult. If not done properly, integration becomes the conduit for erroneous information that's ultimately reported and acted upon.
To deal with the problem of bad information, organizations are on a quest for the "single version of the truth" (SVOT). Interestingly, data warehouse promoters signed up for the SVOT quest as a way of justifying their rather expensive projects. Without ever achieving the goal, many now use SVOT as a justification for the data warehouse: a bit of a non sequitur, in my view. While a well-designed data warehouse may be able to present a single version of the data, "truth" is more elusive; it arises from much more than just the back end of data warehousing. Business rules, models, metrics, presentations, and interpretations are components of the "truth," and these normally are beyond the scope of a relational database and its ETL processes. In addition, most organizations have multiple data warehouses, a reality that begs the question: How many SVOTs can there be?
The heart of a data warehouse is the data model. In most data warehouse methodologies, the first piece of design work is to construct a logical model, which eventually gets transformed into a physical model of tables, attributes, keys, indexes, views and other database objects. Data modelers form the model out of their careful observation of the "requirements" of the system. In other words, the process starts with a container. While a well-designed data warehouse (and as we all know, too many aren't well designed) exhibits a great deal of flexibility in how it may be applied to business problems, most modeling processes are as fluid as concrete: Once set, they can't be altered without a jackhammer. (Since I mentioned Kalido earlier, I'd like to point at that this is precisely the problem the company's products have addressed quite well.)
Models may be extended, but usually what exists can't be changed without the following:
A logical model redesign
A physical model redesign
ETL routine modifications
Modification of all affected views, queries, and extracts
Absorbing the potential failure of downstream unauthorized applications, such as spreadsheets and personal databases
Reloading the data
Extensive testing.
In short, quick changes to a data warehouse are measured in months, not days, hours, or minutes. Can you say backlog?
The root cause of this problem is that the modeling process is presumed complete at the beginning of each development phase. It isn't open-ended, and can't cope with changes gracefully. The physical model becomes the reference model; all back-end and front-end routines address the physical model directly. Although there's metadata all over a data warehouse, most of it is passive and used primarily by one tool for its own purposes: and even in those cases, when there are changes, the administrator of the metadata tools must remap the metadata to the physical layers of the database.
At a functional level, data warehousing didn't achieve the elusive "closing the loop" because most systems lack a bidirectional data flow. Secondly, the warehouse's cumbersome load processes are an impediment to real-time data. EAI and EII have better stories to tell.
The EAI Story
Just a few short years ago, at least at the operational level, it seemed like data integration was going to be a solved problem. Many organizations chose to adopt a single reference model provided by an ERP applications vendor. It was a good news/bad news situation: While everyone applauded the potential end of data fragmentation, few organizations were comfortable with the locked-in notion that the ERP model was the right one for every situation.
Then, organizations discovered how difficult it was to integrate third-party software with ERP systems and develop and maintain that software through ERP version upgrades. EAI came to the rescue by providing standard "connectors" between different packaged systems. EAI provided a convenient, but not simple method of connecting two systems at the programmatic level.
Naturally, when the systems to be connected weren't off-the-shelf, organizations used connector kits. This practice points to a drawback of EAI: The solutions still required a fair amount of programming. Plus, EAI didn't provide much data integration because the software operated at the business-process level and supported primarily transaction processing. And for many organizations, EAI's ephemeral style of integrating instantaneously but not persistently was too confining for the amount of effort and the cost. To the extent that EAI and the connected ERP applications enforced standardized semantics (usually referred to as "canonical representation"), the tools didn't expose metadata in a way that other tools could consume easily.
In fairness, most EAI vendors have expanded their offerings and leveraged the tools and knowledge for broader usage. However, other integration vendors have also expanded their offerings, blurring the distinctions between ETL, EAI, and EII.
The EII Story
Previous articles in this magazine have covered the features of EII in depth. But, to summarize, EII provides access for report writers and other BI tools to the most current data in all of the systems both behind and beyond the firewall. Tool capabilities range from very simple to truly amazing; however, for the most part, EII works by allowing for a common definition or view of the various systems, some form of query federation and optimization, and in a few cases, caching.
The EII field is confusing in how it deals with various types of data, such as structured (classic database and other well-defined formats), semi-structured (flat files, spreadsheets), and unstructured (everything else). No tool can integrate data unless it can bring structure to that which is unstructured because, as we know, computers just aren't that smart (yet). In other words, mining unstructured data may sound pretty good, but a huge amount of work precedes the actual mining.
EII is going to force data warehousing out of its comfort zone by pushing the envelope with respect to what can be done in today's environment with closing the loop, federating queries, expanding the breadth of information available, and tackling latency. This is EII's disruptive potential; it forces us to question our restrictive view of what data warehousing can and should do.
EII, of course, has to prove its mettle. Is it reasonable to assume that you can point a tool at a few databases and effectively and accurately return complex queries, without a data warehouse? Some EII tools are clearly too lightweight to accomplish such a difficult objective. Others, however, are showing robust capabilities to handle integration, but in a way that's alien to data warehouse modelers.
What's New?
Victor Hugo once said "There is nothing more powerful than an idea whose time has come." Fortunately for the field of information integration, there's an emergent idea whose time has come. For the first time since computers were applied to business processing, market forces are driving a convergence of information and application architecture around a set of common standards. This convergence is the enabler of an entirely new way of conceiving of and managing information in organizations: a level of abstraction that allows people and processes to work with the meaning of data, not the data itself.
Our industry has not yet settled on the one buzzword for this idea. Instead, we have a few: semantic information models, ontology, information fabric, and that old favorite, metadata. All of these terms are deficient in capturing the depth and breadth of the opportunity before us. In plain English, the idea is that modeling the meaning of the data is the most important thing to do but not in the same way we build data models now. Our modeling techniques are too tightly coupled with applications and too specific to a given domain. They are also too tightly aligned with a particular data management approach, such as a relational database. Because of this, we end up with silos and have to reintegrate the data after the fact to perform meaningful analysis. Newer approaches represented by EII use a declarative approach, where each additional piece of information expands (or restricts) the model without causing a cascade of structural changes.
Declarative models, in language that makes sense to businesspeople, have the advantage of being approachable and accretive. For example, a salesperson may say, "A customer is an economic_unit that has money in the bank." The credit department may modify that for their purposes with, "...who has placed an order." Of course, the syntax is a little different; as with any emerging technology, the advisors, wizards, and code generators come later.
By agreeing on metamodels first and then designing or retrofitting applications to conform to the models flips traditional methods 180 degrees — and saves organizations by cutting out some very costly steps. I believe it represents the beginning of a long-term solution to the problem of integrating dirty data.
I should be careful not to underestimate the difficulty of developing common metamodels. However, if the effort sounds reminiscent of what organizations faced in developing an enterprise data model, relax. The language for defining metamodels is one that business owners understand. Also, organizations don't have to develop metamodels from scratch. Industry metamodels are becoming available: examples include the insurance industry's ACCORD and FinXML for capital markets. Others include Investment Research Markup Language (IRML), Extensible Financial Reporting Markup Language (XFRML), Extensible Business Reporting Language (XBRL), and Vendor Reporting Extensible Markup Language (VRXML).
These metamodels plus hundreds, if not thousands of other ones are facilitated, of course, by the emergence and acceptance of Web services and XML as universal standards. But that's not the whole story: Web services only deal with the interfaces. Bidirectional metadata, model-based EII, and semantic information models (ontology) provide a rationalized vocabulary. Ontology describes the meanings and relationships of things, and RDF is the language used to build an ontology. It's perfectly acceptable to have many meanings for the same thing, provided the systems can sort out the meanings through the use of schemas and ontology.
The beauty of metamodel approaches — and of the declarative approach in general — is that the job becomes incremental. Adding new material doesn't require scrapping the old. The evolution of these models is a perfect match for the way in which businesspeople continuously learn and refine knowledge about their businesses.
Isn't this a little more like how real organizations work? Instead of calling three different definitions of "customer" a data quality problem and trying to reduce it to a SVOT, wouldn't it make more sense to find a way to celebrate the differences? The key is to do this in a way that isn't confusing or mistaken. This is the promise of the Semantic Web: It challenges everything we currently do with data integration.
Beneficial Change
So, where do you start? Initially, choose focused pain-points for quick return on investment, such as regulatory and statutory compliance, merger and acquisition activity, strategic integration efforts, and BI priorities. You can create a core information model and expand it over time, but it's best to start with an industry standard, such as ACCORD. Then, as you reuse existing models and reverse engineer schemas, keep an eye on value. You can find this value by exposing the benefits of a semantic information model through your information management portal, and track business returns from business agility and information quality.
That's the roadmap. There's a lot more to this story; however, I hope I've made it clear that the foundation upon which we've built our integration architecture is getting wobbly. Does it portend the end of data warehousing? No. But in some ways, the data warehouse will recede in importance, providing certain functionality but losing control of other things. What must change is what Joshua Greenbaum put so eloquently in a recent column in this magazine: "The more you process information, the further you get from the truth."
I would like to thank those who generously shared their time discussing the concepts in this article, including Jeff Morris, Actuate; Andrew Marby, Ascential Software: Naveen Gupta, BEA; Jon Rubin, IBM; Harriet Fryman, Informatica; Jake Freivald, iWay Software; Joe Chappel and Philippe Chambadal, Metamatix; Doug Chope, MicroStrategy; Lothar Shubert and Roman Bukary, SAP; and Andy Astor, WebMethods.
Neil Raden is a consultant and the founder of Hired Brains Research, an industry research and analysis firm that follows the data warehousing, business intelligence, and information integration industries. He welcomes your comments.
RESOURCES
Christiansen, C. The Innovator's Dilemma, Harvard Business School Press, 1997
Hayder, A. "EII: Dead on Arrival," July 19, 2004: www.intelligententerprise.com/ showArticle.jhtml?articleID=23901932
Matthews, T. "A New View on Intelligence," July 19, 2004: www.intelligentintegration.net/ showArticle.jhtml?articleID=23901013
Greenbaum, J., "The Truth About Truth," Sept. 18, 2004: www.intelligententerprise.com/ showArticle.jhtml?articleID=46800502
For a primer on RDF, see the W3C site www.w3.org/TR/rdf-primer
About the Author
You May Also Like