Feature - Opportunistic storage increases grid job success rate
|
||
|
||
The DZero high-energy physics experiment at Fermilab, an Open Science Grid user, typically submits 60,000-100,000 simulation jobs per week at 23 sites. The experiment’s application executables make many requests for input data in quick succession. Due to the lack of storage local to the processing sites, up until recently much of DZero’s simulated data had to transfer in real-time over the wide area network, leading to high latencies, job timeouts and job failures. OSG worked with member institutions to allow DZero to use opportunistic storage, that is, idle storage on shared machines, at several sites. This represents the first successful deployment of opportunistic storage on OSG, and opens the door for other OSG Virtual Organizations. With allocations of up to 1 TB at sites where it processes jobs, DZero has increased its job success rate from roughly 30% to upwards of 85%. Hosting storage resources is often tricky, especially for smaller grid sites, both in terms of hardware and professional expertise, says Abhishek Singh-Rana, coordinator of the Virtual Organizations group in OSG, which helps science communities achieve good results using the OSG. For this reason, the VO group negotiated with the larger OSG science communities, US ATLAS and US CMS, to allow other OSG communities to use their storage resources opportunistically. So far, DZero has used storage at six US-LHC Tier-2 sites, and is looking for more. |
||
Opportunities Work to improve DZero’s job efficiency began in early July and by early August the experiment was producing about 3.7M events per week. By the second week of September, production reached a record 11.0M events, a 130% increase in its average weekly OSG production rate for the past year. DZero’s success demonstrates the OSG’s commitment to establishing relationships with its user communities in order to benefit all members. “We are committed to the goals of the OSG, and that includes the development of opportunistic resources,” says Ken Bloom, manager of the CMS Tier-2 centers in the US. “When the OSG works well, all VOs can benefit. If we can help get opportunistic storage working for DZero, then maybe DZero sites will make some of their storage opportunistically available to CMS, and if we can make good use of that, the reward will be well worth the effort.” —Marcia Teckenbrock, Open Science Grid The OSG is continuing to work with its stakeholders and resource providers to improve the mechanism for using opportunistic storage. CDF and SBGrid have also expressed an interest in using opportunistic storage in the future. See the recommendations based on the DZero use scenario for how OSG sites can enable opportunistic storage. Added 16 October 2008: DZero's push for storage local to processing nodes on the grid was pioneereed by Joel Snow, of Langston University and Fermilab. Snow studied the low efficiency (high failure rate) problem, determined that local storage elements would be the key for improving the efficiency, and worked with OSG to implement a solution. |
Comments
Post new comment