IT At The Edge Of ScienceIT At The Edge Of Science
Bioresearch firms are pushing the performance boundaries of supercomputers but grappling with scalability and management issues
Early this year, the genome-cracking life-science company Celera Genomics Group, the Department of Energy's Sandia National Laboratories, and Compaq joined forces on a four-year, $40 million project to develop a supercomputer for genetics research capable of processing 100 trillion operations per second-eight times as many calculations as the world's largest supercomputer cluster can handle today. The agreement heralded the arrival of genomics research as a new driver of high-end computing. "We're writing the first draft of the future here," said Neal Lane, President Clinton's science adviser, at a Jan. 19 press conference in Washington to announce the partnership.
Biological discovery, propelled by Celera's mapping of the human genetic sequence six months earlier, is outrunning nuclear physics in the complexity of its problems, Sandia president Paul Robinson said at the same conference.
Life-science researchers are pushing the performance boundaries of supercomputers-which comprise hundreds of processors and terabytes of disk space and cost millions of dollars-as they analyze the billions of letters of the genetic code and simulate molecular reactions among cellular proteins in a quest to improve human health. Supercomputers, once chiefly the province of government labs and universities, have become integral to this research. But IT executives at bioresearch firms say vendors are lagging far behind the accelerating need for sophisticated systems management, clustering for symmetric multiprocessing, and data integration and analysis tools.
"Bio-informatics has moved from almost nothing to something, almost overnight," says Marshall Peterson, VP of infrastructure technology at Celera. "But the tools aren't designed for the high throughput and volume of data that's now being generated."
That's cause for concern at as-yet-unprofitable companies such as Celera and Myriad Genetics Inc. Wall Street has signaled that high valuations for companies compiling and selling genetic data may continue only if they start to show profits by entering new drug markets. High on the agenda is the development of medical therapies, customized to an individual's genetic code, for potentially fatal diseases. Applera Corp., Celera's parent company, has already extended Celera's scope: Applera plans to spend about $75 million to transform its $42.7 million brainchild into a full-scale drug discovery company and expand Celera's molecular diagnostics and gene-research businesses. "Celera has entered the race against diseases like cancer," says chairman J. Craig Venter. "We're going to push way beyond the information model."
Bioinformatics has become prominent almost overnight, says Celera VP Peterson, and current tools aren't built for the amount of data generated. |
To meet those goals, bioresearch firms want vendors to deliver more tools that can help cut the time their IT shops spend lashing together distributed-processing systems, integrating data, and managing computing resources-leaving more time for hard-core research. "Vendors rarely put this many systems and this much data together," Peterson says. "We end up doing a lot of the systems-engineering work that Compaq, IBM, Sun, and Hewlett-Packard ought to be doing for us."
The problem stems from the fact that many startup biotech companies, driven by commercial deadlines, competitive pressures, and accountability for quarter-to-quarter financial performance, rely on massively parallel systems that can be built quickly from off-the-shelf processors and memory chips. Those systems break down jobs into pieces that hundreds or thousands of independent CPUs handle, then re-collect the computations to formulate an answer. They're well-suited to biology applications that rely on searching for matching patterns of integers within large databases, and the loosely coupled approach, even priced at tens of millions of dollars, is cheap compared with highly engineered supercomputers whose components and software tools are specially designed by IT vendors to work together seamlessly. The trade-off: Scalability and management tools available for highly engineered supercomputers or even mainframes are in short supply for loosely coupled systems, particularly for the many biotech firms that prefer to use open-source software and are attempting to scale Linux for use on clusters larger than ever before.
Celera's Rockville, Md., data center brims with a supercomputer cluster of 400 Compaq Alpha systems connected to 100 terabytes of disk space. The company runs 150,000 jobs each week, including a biweekly refreshment of the subscription genomics database Celera sells to pharmaceutical companies. It's a delicate balance to make sure the cluster, with its expensive store of processors, disks, memory, and network equipment, is working close to capacity to maximize Celera's return on investment while ensuring key jobs run on time.
That's not always easy, given the lack of sophisticated job-scheduling and accounting tools to push important jobs to the head of the queue. "I have the same pressures as the poor guy who runs Citibank," Peterson says. To help Celera wring better performance out of its cluster, the company last year spent $283 million to buy Paracel Inc. The maker of massively parallel computers, with its expertise in bio-informatics, provides technology that will let Celera run commonly used algorithms in hardware instead of software.
Until recently, performance requirements for supercomputers were set by physicists working in government weapons-research labs. Lawrence Livermore National Laboratory, a nuclear-weapons lab in Livermore, Calif., houses the world's largest computer, according to Top500 .org, a list of the largest supercomputer installations that's maintained by the University of Mannheim and the University of Tennessee. It's an IBM SP consisting of 512 machines and 8,192 processors, covering the area of two basketball courts, that can process 12.3 trillion floating-point operations per second (teraflops) and is used to simulate nuclear explosions. A Compaq cluster being built at the Los Alamos National Laboratory in New Mexico will be able to perform 30 teraflops when it's fully operational next year.
But as the market for proteomics-the study of the function, structure, and interactions of proteins in cells-blossoms, commercial-sector biologists are challenging national labs in the size and scope of their ambitions. Market research company International Data Corp. says global sales of supercomputers to biosciences companies will climb to $840 million by 2005, compared with $162 million last year-a compound annual growth rate of 39%. By contrast, IDC expects the overall global market for supercomputers, valued at $1.08 billion last year, to grow just 13.5% a year over the next four years.
Celera and Sandia's planned Alpha cluster is only the most visible upstart. In December, NuTec Sciences Inc., an Atlanta company that uses supercomputers to perform research for the life-science and oil industries, ordered 1,250 clustered RISC-based p640 servers from IBM, running DB2 Universal Database on AIX and capable of crunching 7.5 trillion calculations per second. When complete later this year, the system will be among the world's 10 largest, according to the Top500 list, and the biggest outside a government agency.
"You're not talking about being at the forefront of science, you're talking about being at the edge of science," NuTec CEO Michael Keehan says. IBM, which invests in NuTec, will help the company build a data-mining and analysis system for the cluster that Emory University will use to devise cancer treatments tailored to patients' genetic profiles.
MDS Proteomics, an $817 million unit of Toronto health-care provider MDS Inc. that secured a $10 million investment from IBM, this year built a 600-CPU cluster of Pentium-based IBM servers running Linux and managed by IBM Unix servers. It's expected to crack the list of the 50 largest computers.
This month, the National Center for Supercomputing Applications at the University of Illinois, which maintains a 1,024-CPU Pentium cluster (ranking 30 in the top 500), will bring online a new 320-processor Itanium cluster that Intel hopes will help demonstrate Itanium's value for molecular modeling and other biology applications. "Is life science advancing the state of scientific computing?" asks Rob Pennington, an NCSA associate director. "The answer is, undoubtedly, yes."
The human genome consists of about 3.12 billion DNA molecules, or base pairs labeled A, C, G, and T; about 90% of human genetic variation can be accounted for by differences of just a single letter in a person's DNA sequence. But those small variations mean big differences in how drugs work. The genetic sequence also indicates the position of an estimated 30,000 genes, each a set of instructions for how to manufacture proteins, which carry out the body's molecular work. Scientists hope that by understanding how elusive proteins work, they can develop more-effective drugs and tests as well as gain a better understanding of how diseases form.
MDS has been refining its cluster-management software that breaks up calculations so they can be distributed across 600 CPUs, CIO Hogue says. |
The research hinges on quickly assembling highly scalable systems. But "there's still a big gap in the software that's been properly parallelized for clusters," says Chris Hogue, CIO at MDS Proteomics. MDS runs 300 dual-CPU Pentium III servers-200 in Toronto and 100 in Denmark-to compare tissue samples with DNA databases, in order to match proteins to the genes that made them. It hopes to sell the results to a large pharmaceutical company. MDS and IBM unveiled a nonprofit company in May that will create a public database of protein information about humans and other organisms.
For four years, Hogue has been refining homegrown cluster-management software called Moby Dick. It breaks into pieces calculations that describe how cells in the body behave when they're diseased and how they interact with certain drugs so that MDS can distribute those pieces across 600 CPUs that IBM PowerPC servers manage.
Hogue says Moby Dick accounts for differences in the way 64-bit PowerPC chips and 32-bit Pentiums store operations in memory, and he figures the software can help life-science startups save money by making it easier for them to work with less-expensive hardware. "I can't realize the cost savings in my cluster unless I have software to run on it," he says.
Gary Smaby, a supercomputer analyst and principal of venture-capital firm Quatris Fund, calls the low-cost clustering approach "Lego-block computing." But only a handful of vendors, such as Compaq, IBM, and Intel, have the financial wherewithal to spend research and development money on scaling off-the-shelf clusters for life-science applications, he says. And no single solution has emerged as a clear leader, leaving many IT departments to fend for themselves.
"There's a huge learning curve going on right now," says Rob Harrison, director of IS at Myriad Genetics, a $34 million drug company that markets screening tests for cancer and hypertension. In April, Myriad launched a $185 million joint venture with Oracle and Hitachi Ltd. The company, Myriad Proteomics, aims to catalog interactions between human proteins on a tight three-year deadline, trying to identify factors that contribute to disease and requiring Harrison's dozen IT staffers to scale its Oracle-on-Sun environment from less than 3 terabytes of data to as much as 50 terabytes.
Critical to the project's success, Harrison says, is making sure Myriad's escalating volume of data is structured properly, so it's accessible through the Web. Myriad is relying on its partners, Oracle and Hitachi, to help it chart those waters. "There are tools out there right now, but we don't know which ones are best to use," Harrison says.
Another IT challenge is managing life-science clusters, particularly Beowulf systems built by linking PCs running Linux or Unix via Ethernet. Steve Oberlin, formerly chief architect for massively parallel systems at Cray Research and now CEO of Unlimited Scale, a startup working on Linux scalability, says Beowulf clusters lack sophisticated management tools. These include tools to save snapshots of applications in progress so they can be restarted after an interruption; absorb new, high-priority jobs into a work queue while keeping CPU utilization above 90%; and assign CPU and disk-space limits to users in groups and charge capacity back to certain departments.
Aware of the potential if they can deliver answers, Compaq, HP, IBM, Oracle, Sun Microsystems, and others are duking it out to win product sales, service contracts, and technology-sharing agreements with life-science companies. Oracle running on Alpha systems has been a mainstay in the biotech field for years, and Compaq led all vendors last year with a 37% share of the billion-dollar market for systems configured and purchased to solve the most demanding computer problems, according to IDC. IBM's share was 29%, while HP and Sun claimed 11% and 10%, respectively. Silicon Graphics Inc. had 7%.
IBM has earmarked $100 million through the end of next year to build a life-science business unit, and it may double that amount. It's also making equity investments in startups such as MDS to gain access to biological expertise. "This isn't something we're doing now because of the dot-com bust, and next year we'll do something else," says Carol Kovac, VP of IBM's life-science unit. "This is the way we think of E-business." Two weeks ago, the vendor said it will provide hardware and software to the British government to link high-powered computers used in scientific research to a national computing grid that will act as a single virtual supercomputer. Experts say computing grids that share remote databases across a network could aid genome research, which relies on mining large volumes of data from far-flung sources.
In an even higher-profile effort, IBM Research is assembling an experimental supercomputer called Blue Gene to simulate protein-folding, a complex task involving tens of thousands of processors and software to coordinate them. Blue Gene is expected to perform more than 1 quadrillion floating-point operations per second once finished in 2003. Other projects are less experimental. DiscoveryLink middleware tailors IBM's DB2 Universal Database for life-science problems, letting users query disparate data sources as if they were one virtual database, without writing custom software-useful at drug companies that house both chemical and biological data.
Sun won't take equity investments in customers, says life-science group manager Sia Zadeh, but it's seeding the market with equipment and co-marketing funds. For example, Sun, Oracle, and systems integrator CGI Group said in June they'd contribute $8.5 million of a $40 million, three-year investment that Caprion Pharmaceuticals Inc.-a privately held Montreal company researching causes of cancer, diabetes, and mad-cow disease-plans to use to chart protein differences between normal and diseased cells.
Despite great interest and grand visions, life-science companies may continue to wrestle for years with the challenges of managing supercomputer clusters. "It takes about 10 years for new implementations of systems software to get to the industrial-strength level," Quatris' Smaby says. But with the world eager for the medical advances, bioresearch can't afford to wait.
Photo of Peterson by D.A. Peterson
Photo of Hogue by Patrick Fordham
About the Author
You May Also Like