New Technology Seeks To Let Startups Build Their Own GooglesNew Technology Seeks To Let Startups Build Their Own Googles
Open source search projects such as Hadoop, Lucene, and Nutch, combined with affordable, on-demand computing through Amazon Web Services, are putting scalable search infrastructure within the reach of most startups.
One of the first questions online startups typically face these days from potential investors is "Why couldn't Google build this?" Entrepreneurs are beginning to respond, "Why couldn't we build Google?"
The slow but steady maturation of open source search projects like Hadoop, Lucene, and Nutch, combined with the availability of affordable, on-demand computing through Amazon Web Services, suggest that scalable search infrastructure is well within the reach of most startups.
Hadoop is a framework for running applications on clusters of commodity hardware that duplicates the functions of the distributed Google File System and Google's MapReduce algorithm for processing large data sets. Lucene is a Java-based search and indexing system. Nutch expands on Lucene by adding Web-based crawling and additional search capabilities.
These open source search projects already are in use at companies and organizations such as Krugle, Powerset, Wikipedia, and Zimbra.
Krugle, a search engine for programmers that helps users find code and technical information online, is built on Nutch and Lucene. "It would have been impossible for us to create the capability that we have and go live in the speed that we did without Nutch and Lucene," says Krugle CEO Steve Larsen. "They were extremely important to us being able to solve the technical problems that we did in a short amount of time."
Access to the code also was important, says CTO Ken Krugler, "so we had the flexibility for the things that we needed for a vertical solution. The commercial solutions are much more restrictive. It's harder to tweak it and form it to what you need."
Krugle maintains about 100 servers at a collocation facility. Krugler says Amazon's Elastic Compute Cloud looks promising but he sees it more for companies that are just getting started. The cloud, also referred to as EC2, is simply virtual processing power than can be paid for as needed.
"It scales better than doing a co-host setup," says Krugler, though he still considers it too new to rely on. "Technically it ought to scale, but you just don't know."
Search startup Powerset is using EC2 to power its forthcoming natural language search site, apparently without any such reservations. In announcing Powerset's use of EC2 at the Web 2.0 Summit earlier this month, founder and CEO Barney Pell said his company's use of Amazon's technology "represents an important shift in the competitive dynamics within the search industry" because Powerset doesn't have to put up the capital to "to build out a data center big enough to scour the entire Web and serve queries for millions of users" to compete with Google and Yahoo.
Pell neglected to mention that his company also is using Hadoop to cache search results before storing them to its local network. In an e-mail sent to the Hadoop developer mailing list, Powerset CTO Lorenzo Thione describes how Hadoop and EC2 can be used in a fault-tolerant search system. "A nice feature of Hadoop as measured against our use of EC2 has been the capability of fluidly changing the number of instances that are part of the cluster," wrote Thione. "Our instances are set up to join the cluster and the [Hadoop Distributed File System] as soon as they are activated and when -- for any reason -- we lose those machines, the overall process doesn't suffer."
Of course, there's a lot more to Google than search infrastructure. Even if rivals reach some measure of technological parity, Google will still have a formidable user base and brand, barring some AOL-style data disaster. And that's to say nothing of making search work as a business. At the moment, there's no open source ad platform to rival what Google, Microsoft, and Yahoo have built, not to mention Amazon and eBay.
But as open source projects get used more frequently in commercially successful projects, the companies using that software drive its development. Krugle's Larsen says his company has helped drive the development of Nutch and notes that Yahoo continues to push Hadoop forward. That kind of work will end up giving future startups even more of a leg up.
About the Author
You May Also Like