Share |

Feature - The Large Hadron Collider

Feature - LHC open to all

An actual recorded event from the Compact Muon Solenoid experiment—this event shows radiation and charged particles spilling into the detector from the beam colliding with material in the beam pipe.

Image courtesy Carl Lundstedt

Occasionally, iSGTW runs across stories in other publications related to the fields we cover. Below is an excerpt from Linux Journal, containing one person’s view of the whole process.

One of the items at the heart of the Large Hadron Collider (LHC) experiments is open-source software. The following will provide a glimpse into how scientific computing embraces open-source software and its open-source philosophy.

The LHC at CERN near Geneva, Switzerland, is nearly 100 meters underground and produces the highest-energy subatomic particle beams on Earth. The Compact Muon Solenoid experiment is one of the many collider experiments within the LHC. One of its goals is to give physicists a window into the universe fractions of a second after the big bang.

The primary computing resource for CMS is located at CERN and is called Tier-0. Its function is to record data as it comes off the detector, archive and transfer it to Tier-1 facilities around the globe. Each Tier-1 facility is tasked with storing this data, as well as particle event reconstruction and analysis, and the transferring of data to secondary centers: Tier-2s.

The jobs

How does a physicist in Europe run a scientific job using data stored in Nebraska? With grid computing of course. Sites in Europe use the World LHC Computing Grid (WLHCG) software, while US sites use the Open Science Grid (OSG) to deploy jobs remotely.

OSG’s mission is to help enable sharing of computing resources. A Virtual Organization (VO) can participate in OSG by providing computing resources and simultaneously use computing resources provided by other VOs. An analogy of this relationship is SETI@home: what it does with people’s desktops, OSG does for research using university computing centers.

A week-by-week accounting of Open Science Grid usage by user VO for the past year. During the past year, OSG has provided 280 million hours of computing time to participating VOs. This image shows the breakdown of those hours by VO during the past year. Forty million of those hours were provided to VOs not associated with particle physics.  Image courtesy Open Science Grid

OSG provides centralized packaging and support for open-source grid middleware and gives administrators with easy installation of certificate authority credentials. Furthermore, OSG monitors sites and manages a ticketing system to alert administrators of problems.

Screens from left to right: Condor View of Jobs; PhEDEx Transfer quality; Hadoop Status Page; MyOSG Site Status; CMS Dashboard Job Status; Nagios Monitoring of Nebraska Cluster; CMS Event Display of November 7 Beam Scrape Event; OSG Resource Verification Monitoring of US CMS Tier-2 Sites HadoopViz Visualization of Packet Movement Image courtesy Carl Lundstedt

The data

Since data transfer and management is such a crucial element for CMS, developing the underlying system has been ongoing for years. The transfer volume of Monte Carlo samples and real physics data has already surpassed 54 petabytes (or 54,000 terabytes) worldwide. Nebraska alone has downloaded 900 terabytes during the past calendar year. All this data circulation has been done via commodity servers running open-source software.

Data movement is managed using a custom framework called PhEDEx (Physics Experiment Data Export). PhEDEx does not actually move data but serves as a mechanism to initiate transfers between sites. PhEDEx agents running at each site interface with database servers located at CERN to determine which data is needed at that site. X509 proxy certificates are used to authenticate transfers between gridftp doors at the source and destination sites.

However, the volume of material is huge: CMS generates more than one terabyte of data per day and each Tier-2 site stores hundreds of terabytes for analysis. This raises the question: How do you effectively store hundreds of terabytes and allow for analysis from grid-submitted jobs?

The answer, when CMS Tier 2s were created at Nebraska, was dCache, a distributed filesystem package written at Germany’s DESY high-energy physics experiment. It acts as a front end for large tape storage (for some applications, tape is still considered superior).

The CMS computing model; however, did not allow funding for large tape storage at Tier-2 sites. It was much cheaper to purchase hard drives and deploy them in worker nodes, than to buy large disk vaults. This means the real strength of dCache was not being exploited at Tier-2 sites.

This prompted Nebraska to look to open source for a solution: HDFS, a distributed filesystem provided by Hadoop — a software framework for distributed computing. HDFS allows easy use of available hard drive slots in computer worker nodes and at low cost.

A student at the University of Nebraska-Lincoln, Derek Weitzel, completed a student project that shows the real-time transfer of data in the HDFS system. Called HadoopViz, this visualization shows all packet transfers in the HDFS system as ‘raindrops’ arcing from one server to another.

The analysis

Once the data is stored at a Tier-2 site, physicists analyze it using the Linux platform. The Tier-2 at Nebraska runs CentOS as its primary platform.

With data files at about 2GB in size and data sets hovering in the low terabyte range, full data set analysis on a typical desktop is impractical. Once the coding and debugging phase is completed, the analysis is run over the entire data set at a Tier-2 site. Submitting an analysis to a grid computing site has been automated with software developed by CMS called CRAB (CMS Remote Analysis Builder).

How a physicist sees CMS—this is the event display of a single simulated event.  

Image courtesy Carl Lundstedt

The Grateful conclusion

The LHC will enable physicists to investigate the inner workings of the universe. The accelerator and experiments have been decades in design and construction and are now setting new benchmarks for energetic particle beams. Even if you don't know your quark from your meson, your contributions to open-source software are helping physicists at the LHC and around the world.

A version of this article first appeared in the November issue (#199) of Linux Journal. To see the full original article, click here.

—Carl Lundstedt, University of Nebraska Grid Computing Center, with edits by Miriam Boon

Your rating: None Average: 5 (1 vote)


Post new comment

By submitting this form, you accept the Mollom privacy policy.