Ewa Deelman, principal investigator of the Pegasus Workflow Management System, and Michael McLennan, director of the HubZero platform, explain how they are working together to aid scientific research.
Over the past several years, the US National Science Foundation has been funding the development of collaborative web sites or ‘collaboratories’ and scientific workflow technologies. Many communities have adopted the HUBzero platform to create collaboratories called ‘hubs’ where they can share ideas, models, experiences, publications, and data in pursuit of research and education. Scientists have also been using the Pegasus Workflow Management System (WMS) to manage complex analyses running on their campus and on large-scale cyberinfrastructure, such as the Open Science Grid, DiaGrid, and XSEDE.
Over the last few months, a joint team of HUBzero developers (Steven Clark and Derrick Kearney of Purdue University, Indianna, US) and Pegasus developers (Mats Rynge and Karan Vahi, University of Southern California, US) together with domain experts Frank McKenna of the University of California, Berkeley, US, and Linas Mockus, also of Purdue University, have integrated the two technologies. Now, the latest HUBzero software, released at the recent HUBbub 2012 conference in September, includes Pegasus WMS as an integrated system. HUBzero supports more than 40 hubs around the world in a wide variety of scientific areas, including nanotechnology, microelectromechanical systems, manufacturing, healthcare, cancer research, biofuels, environmental modeling, volcanology, professional ethics, and STEM education. All together, these hubs have served more than 800,000 visitors during the past 12 months and the integration of Pegasus brings significant computational power to this worldwide community.
Pegasus can manage workflows that are comprised of millions of tasks, while recording all the workflow management and execution information so that the provenance of the resulting data is clear. Pegasus workflows are represented as directed acyclic graphs (DAGs) that describe the computational components, their input and output data, and parameters. These DAGs are written in XML and can be built using Java, Perl, or Python application programming interfaces (APIs). Pegasus then uses information about the available resources, data sources, and code repositories to map the high-level workflow description onto the available clusters, grids, or clouds. Through a variety of protocols, it can access Condor, GRAM, EC2, and other computational resources while retrieving data via GridFTP, Condor I/O, HTTP, scp, S3, iRods, SRM, and FDT. Not all execution environments are the same however: some have shared file systems across the head and worker nodes, whereas others have a fully distributed architecture. As a result, Pegasus needs to augment the workflow, adding tasks to appropriately manage the data. Pegasus can also perform data reuse by accessing data that was previously computed and can checkpoint the workflow, saving intermediate data products as it manages the workflow. If a failure occurs, Pegasus can retrieve the intermediate data and restart the workflow from that point. Other workflows can also make use of these intermediate products in their own computations, thus potentially saving computational time. For data-intensive applications, Pegasus can also minimize the workflow data footprint, by deleting data no longer needed from the execution sites. Equally, when tasks in a workflow are very short running (in the order of seconds), it may be advantageous to cluster them together into larger entities. This is something that Pegasus can do as well.
Pegasus can be used within each hub in a variety of ways. It seamlessly supports the execution of HUBzero’s ‘submit’ command, which sends jobs off to available grid/cloud resources. Pegasus manages the data transfers and recovers from failures should they arise. It can also be used directly by end users or simulation tool developers who want to create their own workflows and launch them within the hub. The resulting applications can be wrapped up in a convenient graphical user interface (GUI) via HUBzero’s Rappture toolkit and published as a ‘tool’ on the hub. As a result, any user can access a customized GUI interface to set up their own complex analysis and launch it with a single click onto the large-scale cyberinfrastructure.
The OpenSeesLab shows a powerful example of Pegasus/HUBzero in action. McKenna used HUBzero’s Rappture toolkit to provide a suite of tools for structural and geotechnical engineers within NEEShub. These tools prompt the user for information and launch the OpenSees program as a computational engine. One of the tools, the Moment Frame Earthquake Reliability Analysis, requires a large number of OpenSees simulations and significant post-processing of the results. McKenna uses a Pegasus workflow within this tool to execute all of the simulations and the post-processing, which run on the Open Science Grid. End users, however, have no idea that all of this is happening. They simply fill in a few parameters, push a button, and view plots of the results.