Feature - Open Science Grid crunches through CMS simulations
In the lead-up to the launch of the Large Hadron Collider and the four massive physics projects that depend on it, scientists around the world have been giving their data-processing muscle a serious workout, with superb results.
As part of the Compact Muon Solenoid experiment, Open Science Grid scientists are crunching through Monte Carlo simulations of what might happen when protons collide inside the CMS detector. Millions of proton collisions must be simulated to gather a meaningful result.
The use of OSG resources to simulate these events continues to be a spectacular success, with OSG-based event production accounting for over half of all fully simulated CMS events produced in 2007.
Simplicity and specificity
One key to success has been the combined use of the CMS ProdAgent workload management tool and JobRouter, coupled with the familiar Condor-G mechanism for dispatching jobs to OSG sites. The CMS Large Hadron Collider Grid (LCG) Production Teams used ProdAgent to drive jobs into the EGEE middleware and resource brokers and also produced a large sample of simulated events at a number of LCG sites.
“ProdAgent allows CMS to run production across a diverse set of globally distributed systems, from grids like OSG and LCG to single farms such as the Tier-0 facility at CERN,” says Evans. “The software consists of components—developed by collaborators from all across the CMS experiment—that communicate with each other using a message passing system. This design allows us to break down a very complex workflow into a set of atomic components that work together to form an efficient, coherent system.”
Black hole defense
The JobRouter Condor component, developed by Dan Bradley at Wisconsin University, is used in the OSG ProdAgent instance to dynamically route jobs to various computer farms at OSG sites.
“The JobRouter performed very well in balancing the workload under changing conditions,” says Bradley. “The software also managed to mitigate site-level ‘black holes’: sites that suddenly start rapidly failing most or all jobs.”
For individual worker-node black holes, a simple approach was used to limit the rate of failures on the node in order to prevent rapid consumption of all jobs waiting in the queue at the site in question.
The approach relied heavily on the dCache storage systems at the involved OSG sites, which were used to store the input/output data files required or produced by the jobs running locally. This significantly reduced the load on each site’s grid-gatekeeper, by limiting data flow to control and debugging information only.
From June 2007 to 30 September 2007, OSG produced about 126 million of 207 million fully simulated events and integrated consumption of about 5.5 million CPU hours. This feat was achieved—thanks to excellent software support from CMS, prompt attention by the OSG site administrators and useful monitoring tools developed by Brian Bockelman at Nebraska—using a single production server. The server comprised four CPUs with 4GB of memory and was able to handle up to 30,000 jobs in the Condor queue and to manage up to 4000 jobs running in parallel (limited due to maximum number of CPUs available) across ten OSG sites: eight in the U.S. and two in Brazil.
This work was supported in part by the CMS project and a Data Intensive Science University Network grant from the National Science Foundation.
- Ajit Mohapatra, University of Wisconsin and OSG