Tools Of The Trade: Data-Cleansing TechnologiesTools Of The Trade: Data-Cleansing Technologies
Data cleansing can require a variety of technologies. Here's a look at the top tool categories...
Data cleansing can require a variety of technologies. Here's a look at the top tool categories:
Data-profiling software. Products include software from profiling specialists Avellino Technologies Ltd. and Evoke Software Corp., as well as the more comprehensive data-quality vendors. They look at millions of records and report on standard errors such as blank fields and incorrect information in fields.
Cleaning software. The first level of data cleansing, these products eliminate duplicates and correct common errors such as incorrect city and ZIP code combos. Much of the cleaning is done through a combination of reference databases, such as master databases from the U.S. Postal Service, standard industrial classification codes, and stock ticker symbols. Often, the tools require multiple passes, or iterations, generally with some human interaction, to make them smarter.
Pattern matching. Pattern matching finds records that share criteria, such as an identical address, and performs changes on those records. The most common function for pattern matching is a task called house holding, which involves combining individuals or companies that share common characteristics.
Augmentation or enrichment. Depending on how clean the data appears to be after profiling is done, augmentation can occur during or after cleaning and pattern matching. Examples of augmentation include adding data from new sources and the inclusion of geographic information systems coordinates.
Only a few vendors offer a complete suite of tools. The largest are Ascential Software, DataFlux (recently bought by SAS Institute), Firstlogic, Group 1 Software, and Trillium Software. Some companies, such as FedEx Corp., use tools from several vendors but separate them into specific strategic applications. Others, such as First Health, have consolidated to use tools from just one vendor.
One of the main reasons First Health, a health-care company, consolidated its software, says Bob Bularzik, assistant VP of software technologies, was that each vendor updates its reference data at different intervals, creating discrepancies when the records are compared.
Return to main story, Avoid Bad-Data Potholes
About the Author
You May Also Like