Consumer and Enterprise Search: Not an Exact MatchConsumer and Enterprise Search: Not an Exact Match

Overshadowed by Google mania and relatively simple technology for consumer search engines, enterprise search is making exciting progress. But the requirements are tougher: businesses demand accuracy, timeliness, integration, and availability.

Seth Grimes, Contributor

May 18, 2004

12 Min Read
information logo in a gray background | information

Thus the orb he roamed
With narrow search, and with inspection deep
Considered every creature.
- John Milton, Paradise Lost

If we view computing as the discipline of automating information management, then search is the computing world's second oldest profession. Surely no software developer embarking on a new project, immersed in use cases and design patterns, ever thought, "the users are going to find my data organization impenetrable so I'd better put a search box on the main screen." The designers do their best; the boss hires usability consultants; but all the same, a search function ends up occupying prime real estate on the Windows Start menu, on the main page of ibm.com, and as a key toolbar feature of Adobe's Acrobat Reader. Users will frequently bypass all the carefully defined hierarchical menus and navigation categories and go straight to search.

Search is a necessary evil, providing shortcuts to documents we know exist but don't know how to reach. Search is also a means of discovering new sources of information on subjects of interest. These two needs are common to consumer Web use and within enterprise environments. But despite all the attention paid to Web search — to Google, Yahoo, and Microsoft's incipient re-entry into the field — and despite enterprise adoption (and adaptation) of Web-search engines, particularly in portals, there's much more to enterprise search than simply presenting ranked lists of hits generated by matching keywords in massive Web-site-content hoards.

Parallel Universes

The differences between enterprise and consumer Web search approaches begin with architecture. Businesses worry about information security, accountability, and data integration, along with interoperability among the varied systems that support business operations. These points aren't important in consumer search, where the big issues are an engine's reach and the relevance and comprehensiveness of the results returned.

The operating environments of the two classes of search also differ. The Web is free of mandated standards beyond the basic languages and protocols that link servers and documents. The low cost of publishing to the Internet — and of being recorded in search-engine indexes — led to the Web's explosive growth to the point where it is now virtually boundless in subject-matter coverage. The corollary, of course, is that information quality and timeliness vary hugely.

For enterprises, the boundaries between corporate space and the public Internet are increasingly blurred. Nonetheless, every company maintains computing systems with proprietary and confidential information that isn't (or at least shouldn't be) exposed on the Internet. The cost of producing, managing, and disseminating information is high for modern enterprises because efficiency and profitability depend on accuracy, timeliness, and availability. This high cost constrains an enterprise to concentrate on information that's directly related to operations, justified by business process needs.

In exchange for ease of use and low cost, consumer Web users accept questionable quality — potentially low accuracy, incomplete subject-area coverage, dubious information value — and the simplest results presentation. You type a few words into a text box and get a list of hits; all that's asked of you is your willingness to view paid placements and other forms of advertising. Most of us are happy with that. Commentators frequently assert that only a small percentage of users attempt so-called advanced search, which allows you to search for expressions composed with Boolean logic and specifying file type, source, and other details. Furthermore, given the variety of uses that may potentially be made of search results, tailoring the presentation of results for any particular use might make other applications more difficult.

The enterprise-search business is quite different. Rather than broad but shallow appeal and advertising funding, enterprise search vendors have to answer needs that are simultaneously narrower and more functionally varied to entice the companies in their target markets to pay for software and support.

The net result is that enterprise search tools must negotiate access to content stores, apply techniques that classify and organize results, and embed the search function behind the scenes in operational applications. These needs imply an understanding of policy-based service provision, of identity, authentication, and authorization, and of the meanings of documents and their contents: their types and the concepts they cover. In contrast, "all" that consumer Web search has to do is to locate and index documents hosted by Web servers working over a far more extensive, distributed network of documents and servers than is contained within any one organization.

Roots and Branches

Enterprise search predates the Web and subsumes the keyword-based approach also applied by the big-name Web-search engines. There's more to enterprise search than just finding terms in documents — and much more to be searched than just document stores — but it's a start. The standard approach is to create document lists and a list of "postings" of words to documents. (XML creator and search expert Tim Bray has published a helpful set of essays, "On Search, the Series," to his ongoing Web site, see Resources.) Search engines don't look only for exact word and expression matches. They commonly "stem" words by removing variations in form such as suffixes, the idea being that two words that share a root — "dictionary" and "diction," for example — are likely related. They'll also apply a thesaurus to search on synonyms of the submitted search words.

In replacing exact searches with "fuzzy" interpretations, these standard techniques may introduce errors, as illustrated by the two words chosen as a stemming example. You'd welcome an engine's broadening a search on "dictionary" to include "lexicon" but not, say, "elocution." Search engines build in syntactic and semantic rules to improve their accuracy. They attempt to enhance their algorithms with a dose of rule-based reasoning, drawn from artificial intelligence (AI).

Exact and extended matches aren't sufficient because natural language is chaotic. Natural language is irregular and lacking in canonical forms of expression — think of "there's more than one way to skin a cat." Even worse, natural language is full of slang. And of course there are dozens of major languages and dialects of interest as well as vast amounts of nontextual information.

These issues led Bray to write the following in "On Search, the Series": "Intelligent search is among the hardest of the hard AI problems. So, don't expect to be buying software that does it by this time next year.... If we want better search (and we do), we'd better not count on AI voodoo or linguistic juju or semantic mojo. We need to work with good, sound statistical techniques, and be clever about generating and using metadata, and we need to get our [application programming interfaces] right. All of these things are hard, and there is good work being done in all of them."

Metadata and Categorization

Bray says that intelligent search is hard and that one alternative is to boost the effectiveness of less-than-perfect search technologies by improving the ability to search documents and document sets. Information is much easier to find when individual objects are thoroughly described with standard metadata. And document sets are easier to search when their contents are classified and grouped into meaningful categories that capture links, common characteristics, and other forms of relationships. The goal is the same as when you model and store numerical data in a relational database. Prabhakar Raghavan, vice president and CTO of search vendor Verity, says that the "bulk of information value is perceived as coming from data in relational tables. The reason is that data that is structured is easy to mine and analyze" ... and easy to search.

Basic metadata describes documents via fields such as title, author, and version history. This information is most frequently expressed in formats such as HTML "metatags" and the fields that you see under File | Properties in word-processing programs. As a byproduct, computerized systems record other metainformation, such as file name and owner and creation date. All of this documentation is irregularly formed, and few authors willingly create much of it anyway. Systematic content management and accurate search require much more than this limited metainformation to work properly. Standardization efforts notably include the Dublin Core Metadata Initiative and the Object Management Group's metainformation framework. These efforts will ease system integration and interoperability: but you still have to populate all these structures.

You can create metainformation in a number of ways: manually, by building metadata capture into automated business process; and automatically, by inferring and extracting values after the fact. Manual documentation — that is, librarianship — is slow and expensive, and often relies on domain (semantic) knowledge and linguistic competence. Better to capture and normalize the metainformation that's generated in the course of business operations: This is the goal of a number of vendors. Verity calls the overall task "intellectual capital management." CTO Raghavan says that "the onus is on enterprise software companies to integrate content management" into their suites. To this end, Verity has established numerous third-party OEM relationships.

Kris Marubio, marketing vice president at Autonomy, characterizes her company's products as "infrastructure solutions" that mediate among content management systems, user-facing interfaces such as portals, and operational applications for CRM and the like. The company is pursuing parallel drives focusing on OEMs and on specific application areas, according to her colleague, Suranga Chandratillake, Autonomy director of technology. These application areas include call centers, where the object is to enable representatives to conduct real-time searches in the course of calls and managers to later search call records, and compliance, where the goal is to fulfill corporate needs to reach into diverse communications — email, telephone, and instant messages — and apply policies for information use and reporting.

Convera is a third vendor pursuing an OEMapplications strategy, also targeting compliance, customer self-service, and the like with built-in cognizance of business rules and workflow and the ability to handle disparate data from a wide variety of sources.

Statistical Contributions

Statistical processing complements other approaches, bringing to bear mathematical techniques that discern patterns, infer relationships, and enable accurate prediction. It can compensate for inadequate manual and automated metadata collection and surmount the narrowness inadvertently imposed by automated business processes, which limit the metadata framework to whatever's necessary for a given process, accompanied by preconceived notions of what information is important. Statistics developed to describe worlds full of variation, incomplete knowledge, uncertainty, and error precisely like those we wish and need to search.

The application of statistical techniques to extract concepts, discern relationships, and classify unstructured documents in categories is called text mining, a cousin to data mining, which operates on semistructured numerical data for similar ends. Where data mining can determine the principal components of a data model — such as the important dimensions of a data space — text mining can analogously generate taxonomies and hierarchical representations via categories of what might be termed a "knowledge" space. Taxonomies date back thousands of years; the most famous computing-era example isn't an automatically generated taxonomy but rather Yahoo's manually generated portal. But whether compiled by librarians and subject-matter experts or by computer programs, the goal is to enhance the search of a document set.

The three vendors I've cited as well as others perform taxonomy-based text mining and statistical classification. Enterprise adoption of these technologies has a corollary benefit that's independent of any given software vendor. According to Dale Hazel, Convera senior vice president for marketing, enterprises "should have good knowledge architecture as a way of organizing information.... The organizational phase of solution deployment is [often] neglected." Information collection and use does not align with business processes and goals, Hazel says, "which is why taxonomies are important."

Descriptive metainformation and statistical pattern recognition and classification may be applied to render nontextual objects searchable, which is good because a very high proportion of business information is nontextual. This information consists of speech and other nonverbal audio, images, and video, and intangibles such as relationships. Nexidia provides an audio solution — a "phonetic" search engine — that the company sells directly and licenses for integration with third-party applications.

Vendors such as LinkedIn and Spoke tout the business value of social networks that consist of employees' professional and personal contacts. Tools from these and other emerging software developers demonstrate that organizations can model social networks and systematically exploit them via network search functions that detect and transmit requests using connections within and between companies. The varying approaches mirror those in the text-mining world. LinkedIn users' networks are purely intentional, built link by link, and have correspondingly high quality, while Spoke throws technology at the problem by mining email to infer relationships and their strength, leading to broader but lower-quality networks.

Internet-mediated social networks are by definition restricted to participants with Internet access; no search technology can work with documents (or at least metadata describing documents) that aren't in electronic form. Enterprises recognize this inherent limitation more than consumer Web users do, who often complacently ignore any information source that isn't online.

Enterprises are driven by mandates that include boosting profitability and competitiveness, providing a single, integrated view of customer information, meeting corporate compliance requirements, and preventing terrorist attacks. While Google and Yahoo share significant technology underpinnings with the enterprise search vendors that aim to help organizations meeting goals of this demanding nature, corporate security, accuracy, integration, and interoperability demands will keep enterprise and Web search implementations on separate but parallel tracks for the foreseeable future.

Seth Grimes is a principal of Alta Plana Corp., a Washington, D.C.-based consultancy specializing in business analytics and demographic, marketing, and economic statistics.

Resources

Autonomy: autonomy.com

Convera: convera.com

Dublin Core Metadata Initiative: dublincore.org

"On Search, the Series": www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTOC

Nexidia: Nexidia.com

Verity: Verity.com

Online at IntelligentEnterprise.com:

"Matchmaker, Matchmaker," April 17, 2004

"The Word on Text Mining," Dec. 10, 2003

Read more about:

20042004

About the Author

Seth Grimes

Contributor

Seth Grimes is an analytics strategy consultant with Alta Plana and organizes the Sentiment Analysis Symposium. Follow him on Twitter at @sethgrimes

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights