Hadoop Crunches Web-Sized DataHadoop Crunches Web-Sized Data

To parse large volumes of data from Web servers, Yahoo and others turn to Hadoop's open source cloud-based analysis system.

Charles Babcock, Editor at Large, Cloud

November 6, 2009

4 Min Read
information logo in a gray background | information

As the World Wide Web has exploded into millions of sites and billions of documents, the search engines that purport to know about everything on the Web have faced a gargantuan task. Sure, more spiders can be activated to crawl the Web and collect information. But what system can analyze all the data before the information is out of date?

The answer is a cluster-based analysis system, sometimes referred to loosely as a cloud database system. At the Cloud Computing Conference and Expo Nov. 3 in Santa Clara, Calif., representatives of Yahoo explained how they use Hadoop open source software, from the Apache Software Foundation, to analyze the Web.

Hadoop is a system that can be applied to "big data," or masses of data collected from the Web, such as the crawls that lead to the search indexes. Eric Baldeschwieler, VP of Hadoop software development, leads the largest known active Hadoop development team and said Yahoo is the world's largest Hadoop user. It uses Hadoop on clusters of 4,000 computers to analyze up to 92 petabytes of data stored on disks.

Hadoop builds Yahoo's indexes of the Web that power the Yahoo search engine. Its Web mapping system "runs in 73 hours, taking as input, data from all the Web pages in the world," he said. Yahoo's digest of Web pages consists of 300 terabytes of data. Hadoop analysis tells Yahoo's ad system what ads to serve to visitors, based on their profile from searches they've conducted on the site.

It's use of Hadoop keeps it running on a total of 25,000 servers at the company, he said. Yahoo distributes its tested, production version of Hadoop for free, Baldeschweiler said.

Another speaker at the conference was Christophe Brisciglia, a former Google engineer and now part of the founding team at Cloudera, a firm that is producing a supported enterprise distribution of Hadoop. "Cloudera is to Hadoop as Red Hat is to Linux," he said.

Brisciglia described Hadoop as "a batch data processing system" for use on clusters of commodity hardware. Unlike relational database, "in Hadoop there is no structure (to the data). You can dump incredibly large amounts of data into a Hadoop cluster and figure out what to do with it later." A major piece of Hadoop is the MapReduce function developed at Google as a way to speed response time to search queries. MapReduce lifts data off of disks in parallel, taking the speed of one drive and multiplying it by the one thousand or however many drives are being used in a Hadoop analysis. A mapper function correlates the location of the data to the nearest processor in the cluster and directs it there, where the reduce function, either sorting, filtering or analyzing the data in some simple way, occurs.

Unlike a relational database, which goes to a table or small set of tables for a limited amount of data, all or most of the data in a given data set, such as a Web crawl, will be unloaded from disk and analyzed at one time in a MapReduce exercise.

Hadoop relies on the Hadoop Distributed File System, which distributes data across a distributed disk drive system in large chunks. Most file systems break files into small blocks of a few kilobytes for storage purposes. The HDFS subdivides files in 64 or 128 MB chunks, minimizing the time a disk drive head will spend seeking a blocks of data.

In this way, Hadoop has gone from experimental software to the winner of the Jim Gray prize for the fastest sort of a terabyte of data. Gray was a notable Microsoft database researcher and Turing award recipient who was lost at sea in 2007.

In 2008, Hadoop required 209 seconds to analyze the terabyte, using a 900 node cluster. In 2009, it won the prize for an analysis that took only 62 seconds, running on 1,500 nodes, said Yahoo's Baldeschwieler.

The name Hadoop is whimsical, according to Tom White, an original Hadoop developer and author of Hadoop, The Definitive Guide. He cites fellow developer Doug Cutting as saying it was meant to be "short, easy to spell and pronounce, and meaningless."


information has published an in-depth report on new software models. Download the report here (registration required).

Read more about:

20092009

About the Author

Charles Babcock

Editor at Large, Cloud

Charles Babcock is an editor-at-large for information and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive Week. He is a graduate of Syracuse University where he obtained a bachelor's degree in journalism. He joined the publication in 2003.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights