Share |

iSGTW Feature - How to run a million jobs

Feature -  How to run a million jobs


Image courtesy of David Abramson.

At SC08, several experts organized an informal session to share information on up-and-coming solutions for expressing, managing, and executing “megajobs.” They also discussed ways of repackaging work to avoid megajobs altogether.

Here iSGTW shares the latest ideas and developments about megajobs with its readers, and plans to follow up with articles on various mentioned technologies and trends in the coming months. Contributions are welcome.

Biting off a megajob—it's a lot to chew

As large systems surpass 200,000 processors, more scientists are running “megajobs”, thousands to millions of identical or very similar, but independent, jobs executed on separate processors.  From biology, physics, chemistry and mathematics to genetics, mechanical engineering, economics and computational finance, researchers want an easy way to specify and manage many jobs, arrange inputs, and aggregate outputs. They want to readily identify successful and failed jobs, repair failures, and get on with the business of research. System administrators need effective ways to process large numbers of jobs for multiple users.

Many small meals

As tools and resources change, people describe their computing jobs differently, says Ben Clifford of the University of Chicago. A decade ago, scientists submitted big jobs with big inputs and outputs. Now, due to the availability of massive processor farms, jobs tend to be numerous and small—seconds long with only kilobyte-scale data files.

Some older, well-established job management systems are extremely feature-rich, but their high overhead in scheduling and persistency makes them inefficient for executing many short jobs on many processors. Others have been developed specifically for the data-intensive, loosely-coupled, high throughput computing (HTC) grid model. These newer systems work well up to many thousands of jobs, short or long. Still newer ones, like Falkon and Gracie, which aim to scale even higher, have yet to achieve wide-scale deployment.

As job management systems change, so do applications. Ioan Raicu and Ian Foster, both of the University of Chicago and Argonne National Laboratory, have defined a class of applications called Many Tasks Computing (MTC). An MTC application is composed of many tasks, both independent and dependent, that are (in Foster’s words) “communication-intensive but not naturally expressed in Message Passing Interface,” referring to a standard for setting up communications between parallel jobs.  In contrast to high throughput computing, MTC uses many computing resources over short periods of time to accomplish many computational tasks. Megajobs naturally fit in both the HTC and MTC class of applications.

Some computer systems are undergoing extensions to support megajobs, HTC and MTC; for example, IBM provides a new high throughput, grid-style mode on the Blue Gene/P supercomputer. Raicu and his colleagues have implemented MTC support on this new platform and others via Falkon, a fast, scalable and lightweight task execution framework from the University of Chicago.
 
Ask about Swift

“Users don’t start with the idea of running a million jobs,” Clifford says. “They start with some high level application, and it’s just convenient to describe it as a million jobs.” If they can break an application into separately schedulable, restartable, relocatable ‘application procedures’, he says, then they just need a tool to describe how the pieces connect. Then the jobs are easy to run.

Clifford has helped develop Swift, a highly scalable scripting language/engine to manage procedures composed of many loosely-coupled components that take the place of megajobs. In addition to describing jobs, Swift sports clever mechanisms to reward well-behaved, high-performing sites while penalizing slow or failing sites. It also throttles job submission as needed, and controls file transfers to ensure adequate performance.

Also in the megajob medicine cabinet are Falkon and Gracie, a framework for executing massive independent tasks in parallel on grid resources. 

A slide from Li Hui's (Peking University) presentation on Gracie showing the advantage of a user-oriented resource allocating strategy.  Gracie employs three optimizing strategies for efficiency and resource utilization: it packs upto 1000s of tasks into a single grid request, it uses the user-oriented resource allocation (shown), and it balances the task workload among multiple machine resources.

Image courtesy of Li Hui.  

Indigestion?  Rethink, repackage …

Some experts think that many research questions can be answered using the same grid-type computing model, but using a different approach than megajobs.

David Abramson of Monash University wants the grid community to help scientists find smarter ways to specify problems by focusing on the scientific questions in the first place. He cites his Nimrod family of tools that provides sophisticated methods to search for good solutions as opposed to all solutions. Users can ask complex questions such as “Which parameter values will minimize the output of my model?” This helps shield the user from the complexity of managing lots of independent jobs.

John McGee from the Renaissance Computing Institute notes that a number of workload management systems on Open Science Grid including PanDA, the OSG MatchMaker, and Condor DAGMan, are used extensively to repackage the total computational workload into chunks that are optimized for the infrastructure and scheduling overhead.

According to Gregor von Laszewski of the Rochester Institute of Technology, many supercomputing applications exist that manage millions of “tasks” internally. The differentiation between “tasks” and “jobs” is essential, he says, as many users do not care about the concept of a “job”. They just want to get their work done. He believes that in the future scientists will benefit from tools that allow them to bypass the technical difficulties of mapping their work to jobs—tools that transparently perform the mapping between millions of tasks and traditional queuing systems.

In the end, a balanced diet

“The answer will not be a single tool,” von Laszewski says, “but a suite of tools addressing a variety of solutions needed to simplify access to millions of science services working in coordination.” As an example, he cites the CoG Kit and its workflow engine used internally in Swift to provide much of the functionality for scalable services, noting that it also can be used directly by scientists to simplify their job management.

To simplify access and avoid congestion, the session participants recognize that more work is needed to leverage tools and services that integrate megajobs with existing queuing systems and high performance interconnects, and to allow loosely-coupled applications to communicate directly between processors.  In the meantime, researchers have some remedies to choose from.

See links to more information and presentations from the session.

Anne Heavey and Amelia Williamson, iSGTW, and session participants David Abramson, Ioan Raicu and Gregor von Laszewski

Editor's note 5 December: small corrections and clarifications made to original article.

Your rating: None Average: 1.7 (3 votes)

Comments

nice post

We live in a time when it can

We live in a time when it can be difficult to find proper employment. Not only has the economy taken its toll in many of our lives, there is also an increase in the competition which can make it even more difficult for us to live our day-to-day lives. http://www.best-cv-templates.com/

Our research applies the new

Our research applies the new field of computational evolutionary finance to study the effect of short sale bans and transaction taxes on financial market stability. Kosmetik Online

Users don’t start with the

Users don’t start with the idea of running a million jobs,” Clifford says. “They start with some high level application, and it’s just convenient to describe it as a million jobs.” If they can break an application into separately schedulable, restartable, relocatable ‘application procedures’, he says, then they just need a tool to describe how the pieces connect. Then the jobs are easy to run. http://www.goguides.org/topic/102565.html

They are also light and as

They are also light and as powerful as the electric chainsaws. What if you never put oil in you lawn mower? It would blow up and it would be you own fault. Husqvarna chainsaws Loggers who need to clear large areas of land prefer using the petrol version as it offers faster and more powerful cutting. Another group that can be created for chainsaw parts are those that help the operator in using the chainsaw.

Leer deze grondbeginselen

Leer deze grondbeginselen over horloges en je zult snel ontdekken dat u minder tijd te besteden en blader door meer horloges, dus je kan degene die het beste past bij uw smaak en uw stijl van life.Depending op uw leeftijd opzoeken, kunt u wel of niet herinneren het zien van uw vader de wind zijn horloge elke avond voor het naar bed gaan. Hoe kun je bepalen of je kan het nodig zijn een horloge winder? Ten eerste worden horloge winders nuttig geacht voor het gemak, maar niet helemaal een noodzakelijk item. Horloges Mannen Sommige zijn op batterijen, terwijl anderen opereren vanuit de kracht van de zon! Ook zou de aard van de horloges u eigenaar bent van invloed op de aard van de watch winder u koopt.

Post new comment

By submitting this form, you accept the Mollom privacy policy.