| December 2007 - October 2008 plot of Falkon across various systems (ANL/UC TG 316 processor cluster, SiCortex 5832 processor machine, IBM Blue Gene/P 4K and 160K processor machines). Over the past year, Falkon has seen wide deployment and usage across a variety of systems, from the TeraGrid, the SiCortex at Argonne National Laboratory, the IBM Blue Gene/P supercomputer at ALCF ANL, and the Sun Constellation supercomputer from the TeraGrid. Each blue dot in the figure represents a 60 second average of allocated processors, and the black line denotes the number of completed tasks. In summary, there were 163K peak number of processors, with 1.4 million CPU hours consumed and 164 million tasks for an average task execution time of 31 seconds. Image courtesy of Ioan Raicu. |
(Editor's note: Ioan Raicu and Ian Foster, both of the University of Chicago and Argonne National Laboratory, contributed this article.) Applications that run thousands of jobs can cause headaches. Huge numbers of job submissions to a site often cause bottlenecks, make system administrators grumpy, and worse, bring down remote gateway nodes, rendering the resources useless and losing jobs in the process. Traditional techniques commonly used in the scientific community do not scale to today’s — let alone tomorrow’s — largest grids and supercomputers. But the new class of applications called Many Task Computing, discussed in the recent article “Many Task Computing: Bridging the performance-throughput gap” has spawned development of a new framework, called Falkon, that enables applications to scale up quite painlessly and use these large systems efficiently. Minutes to milliseconds
Falkon (Fast And Light-weight tasK executiON) is designed to help restructure applications to reduce job wait time, network bandwidth and job submission overheads from minutes to milliseconds. It leaves many of the higher overhead features such as accounting and persistency, for the local resource managers or the applications to handle. Falkon focuses on efficient handling of many independent tasks on large-scale distributed systems with many processors. Falkon has demonstrated vast improvements in performance and scalability for a wide variety of tasks — tasks with execution times ranging from milliseconds to hours, compute- and data-intensive tasks, and tasks with varying arrival rates. The improvements extend across diverse applications from astronomy to medicine, economic modeling and beyond, and to scales of billions of tasks on hundreds of thousands of processors.
One researcher who adopted Falkon is Andrew Binkowski at the Midwest Center for Structural Genomics at Argonne National Laboratory. Binkowski and his team model three-dimensional protein structures in their basic research towards drug design. Since proteins with similar structures tend to behave in similar ways, the team compares the modeled structures to existing, known proteins in order to predict their functions -- a computationally intensive task.
“As the Protein Data Bank (a repository of known proteins) expands almost exponentially, it becomes more difficult to coax desktop machines to do the types of analysis required,” says Binkowski. “We turned to Falkon as a way to utilize our existing software applications.” |