Batten Down The Hatches -- Here Comes Google's OCRopus!Batten Down The Hatches -- Here Comes Google's OCRopus!

Google's new open-source OCR software will transform online search technology. Almost certainly, it will also catch some companies with their virtual pants down.

Matthew McKenzie, Contributor

November 5, 2008

2 Min Read
information logo in a gray background | information

Google's new open-source OCR software will transform online search technology. Almost certainly, it will also catch some companies with their virtual pants down.Scanned PDFs have long befuddled Google's online indexing system. Although text-based PDF documents created with Acrobat Distiller or similar products are relatively accessible, those stored as scanned images are a very different story. Here is how Google employee Evin Levey describes the dilemma in a recent blog post: "While we've indexed documents saved as PDFs for some time now, scanned documents are a lot more difficult for a computer to read. Scanning is the reverse of printing. Printing turns digital words into text on paper, while scanning makes a digital picture of the physical paper (and text) so you can store and view it on a computer. The scanned picture of the text is not quite the same as the original digital words, however -- it is a picture of the printed words. Often you can see telltale signs: the ring of a coffee cup, ink smudges, or even fold creases in the pages." For a couple of years now, Google has been working on a solution to this problem. The company's open-source OCRopus project has steadily improved to the point where Google is now ready to launch it officially. As a result, huge numbers of scanned PDF documents published online will soon be far more open and accessible to Google's standard indexing tools.

From a Web-search user's point of view, OCRopus promises a vast improvement in both the quantity and quality of their search results. Many online publishers, however, including quite a few businesses, are likely to find themselves scrambling to adjust to this new paradigm.

The problem boils down to a familiar mistake: relying on security through obscurity. Many companies (and certainly many individuals) see scanned PDF files as a relatively safe way to publish information that isn't sensitive enough to keep completely offline yet that is sensitive enough to warrant keeping it out of standard Web-search results. Although Google could glean some information about scanned PDF documents indirectly, such as through analysis of other sites linking to those documents, the company's indexing tools could extract very little information from the documents themselves -- that is, until now.

(There are also, in theory, ways to accomplish the same thing using a properly configured "robots.txt" file on a Web server. Although Google's indexing software will honor such requests, not every search-and-index robot crawling the Web is so well-mannered.)

It isn't clear how much Web real estate OCRopus has crawled at this point. It is a safe bet, however, that if OCRopus hasn't yet passed through your online neighborhood, it probably will before too long. If your company's Web site contains scanned PDF documents best kept out of Google's standard Web-search results, now is a good time to identify and secure them.

Read more about:

20082008
Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights