ReviewCam: OpenCalais -- "Semantic Plumbing"ReviewCam: OpenCalais -- "Semantic Plumbing"
When we look back, years from now, this decade's Internet Search may seem prehistoric. Perhaps that's why so many companies (including the usual suspects) have begun working toward an enlightened future. Thomson Reuters is one such company, and its OpenCalais project provides what the company likes to call semantic plumbing; you can see what that means in our ReviewCam.
September 9, 2009
When we look back, years from now, this decade's Internet Search may seem prehistoric. Perhaps that's why so many companies (including the usual suspects) have begun working toward an enlightened future. Thomson Reuters is one such company, and its OpenCalais project provides what the company likes to call semantic plumbing; you can see what that means in our ReviewCam.For a deeper look, view the video below. There are some result sets using a simple testing tool, and a few examples of some of Thomson Reuters customers.
; While it seems like we've been successfully using search for a long time, there is much work to be done in getting to the right results more quickly. Nary a user would admit an unspoken and angry command to a badly returned results page: Don't find what I typed, find what I meant! Behind Microsoft's Bing's well conceived user interface is a better result; behind the high-horsepower drive train Google is purported to be putting into its Caffeine project is a better result. Companies like Blinkx and AOL (Truveo) want to help you find the right video. You can search images, tweets. There's more, and then there's better. A few months ago, I profiled so-called semantic search company Truevert, which relied not on a better ontology, but upon the body of work that end-users actually do when searching for data. Here's the key paragraph from that blog: A true semantic-based approach trusts a context, rather than a categorization. OrcaTec started Truevert with a more vertical approach, namely "green." So everything gets searched through that filter. It uses Yahoo BOSS to gather a Web search, but it then re-ranks the results based on its own language model derived from understanding the association and context of words from 6,000 green-tagged documents in Delicious (which it can do on a mere laptop in less than 15 minutes). Google's terms of service, Roitblat says, don't allow re-ranking of pages the way Truevert does it. One mistake I made when I wrote that piece was calling Thomson Reuters OpenCalais a competitor. Instead, Calais is a web service. A publisher, a developer, a site can submit its information to the open source Calais service and it will provide all of its magic behind the scenes to provide a more contextually relevant result set to the search engines. Thomson Reuters acquired this technology a couple of years ago from ClearForest, and its goal is natural language processing. When I talked with the company months ago, it explained that most natural language processing efforts are based on RDF (resource description framework), a much more structured form of XML, the mission of which is to publish data rather than web pages. There's a query language for RDF called sparql, and an ontology to describe the data. Using the Calais web services API, you can take unstructured text, run it through the natural language engine and it spits out information in RDF. In theory, then, you get better information and you get it extremely fast. Lots of other technology is needed to take this further, like the linked data standard that Tim Berners-Lee is working on -- essentially the building block to create linkages between all of this richer meta data. An example of this kind of work is DBpedia, which is an effort to create links between information on the Web and Wikipedia content. With OpenCalais and other similar technologies, you can envision publishers (or anyone) being able to create very specific information streams on unique web pages, or in widgets on existing pages (or tag clouds), based on a very well-crafted set of criteria; that is, not just finding information when you know you need it, but finding relationships between information in a predictable way to yield unpredictable results -- better, fresher results, surprising results. And doing it automatically, and, as Thomson Reuters likes to point out, in ways that humans think. Clearly, then, it's in Thomson Reuters' interest to give this natural language engine away. The more people who use it, the richer it can make its own information and the more frequently its massive information database can get in the hands of more people (or on more sites). What it's losing is the control of how that information is presented, and how it's linked to other data -- on its own sites it can manage the data and the flow and the presentation, but it's also a very manual and insular process. The company noted that the amount and type of content is exploding (user generated content, Twittter, etc.) and "we can act like AOL and pretend it's not happening, or acknowledge that it is and embrace it through interoperability." And being a trusted source of information against which to bounce all of that "wild content" is where Thomson Reuters wins. "Hedge traders want to pay attention to twitter and blogs, but they need to bounce that against content they can trust." Fritz Nelson is an Executive Editor at information and the Executive Producer of TechWebTV. Fritz writes about startups and established companies alike, but likes to exploit multiple forms of media into his writing. Follow Fritz Nelson and information on Twitter, Facebook, YouTube and LinkedIn: Twitter @fnelson @information @IWpremium Facebook Fritz Nelson Facebook Page information Facebook Page YouTube TechWebTV
Fritz Nelson on LinkedIn information
About the Author
You May Also Like