Feature - LHC open to all
Occasionally, iSGTW runs across stories in other publications related to the fields we cover. Below is an excerpt from Linux Journal, containing one person’s view of the whole process.
One of the items at the heart of the Large Hadron Collider (LHC) experiments is open-source software. The following will provide a glimpse into how scientific computing embraces open-source software and its open-source philosophy.
The LHC at CERN near Geneva, Switzerland, is nearly 100 meters underground and produces the highest-energy subatomic particle beams on Earth. The Compact Muon Solenoid experiment is one of the many collider experiments within the LHC. One of its goals is to give physicists a window into the universe fractions of a second after the big bang.
The primary computing resource for CMS is located at CERN and is called Tier-0. Its function is to record data as it comes off the detector, archive and transfer it to Tier-1 facilities around the globe. Each Tier-1 facility is tasked with storing this data, as well as particle event reconstruction and analysis, and the transferring of data to secondary centers: Tier-2s.
How does a physicist in Europe run a scientific job using data stored in Nebraska? With grid computing of course. Sites in Europe use the World LHC Computing Grid (WLHCG) software, while US sites use the Open Science Grid (OSG) to deploy jobs remotely.
OSG’s mission is to help enable sharing of computing resources. A Virtual Organization (VO) can participate in OSG by providing computing resources and simultaneously use computing resources provided by other VOs. An analogy of this relationship is SETI@home: what it does with people’s desktops, OSG does for research using university computing centers.
OSG provides centralized packaging and support for open-source grid middleware and gives administrators with easy installation of certificate authority credentials. Furthermore, OSG monitors sites and manages a ticketing system to alert administrators of problems.
Since data transfer and management is such a crucial element for CMS, developing the underlying system has been ongoing for years. The transfer volume of Monte Carlo samples and real physics data has already surpassed 54 petabytes (or 54,000 terabytes) worldwide. Nebraska alone has downloaded 900 terabytes during the past calendar year. All this data circulation has been done via commodity servers running open-source software.
Data movement is managed using a custom framework called PhEDEx (Physics Experiment Data Export). PhEDEx does not actually move data but serves as a mechanism to initiate transfers between sites. PhEDEx agents running at each site interface with database servers located at CERN to determine which data is needed at that site. X509 proxy certificates are used to authenticate transfers between gridftp doors at the source and destination sites.
However, the volume of material is huge: CMS generates more than one terabyte of data per day and each Tier-2 site stores hundreds of terabytes for analysis. This raises the question: How do you effectively store hundreds of terabytes and allow for analysis from grid-submitted jobs?
The answer, when CMS Tier 2s were created at Nebraska, was dCache, a distributed filesystem package written at Germany’s DESY high-energy physics experiment. It acts as a front end for large tape storage (for some applications, tape is still considered superior).
The CMS computing model; however, did not allow funding for large tape storage at Tier-2 sites. It was much cheaper to purchase hard drives and deploy them in worker nodes, than to buy large disk vaults. This means the real strength of dCache was not being exploited at Tier-2 sites.
This prompted Nebraska to look to open source for a solution: HDFS, a distributed filesystem provided by Hadoop — a software framework for distributed computing. HDFS allows easy use of available hard drive slots in computer worker nodes and at low cost.
A student at the University of Nebraska-Lincoln, Derek Weitzel, completed a student project that shows the real-time transfer of data in the HDFS system. Called HadoopViz, this visualization shows all packet transfers in the HDFS system as ‘raindrops’ arcing from one server to another.
Once the data is stored at a Tier-2 site, physicists analyze it using the Linux platform. The Tier-2 at Nebraska runs CentOS as its primary platform.
With data files at about 2GB in size and data sets hovering in the low terabyte range, full data set analysis on a typical desktop is impractical. Once the coding and debugging phase is completed, the analysis is run over the entire data set at a Tier-2 site. Submitting an analysis to a grid computing site has been automated with software developed by CMS called CRAB (CMS Remote Analysis Builder).
The Grateful conclusion
The LHC will enable physicists to investigate the inner workings of the universe. The accelerator and experiments have been decades in design and construction and are now setting new benchmarks for energetic particle beams. Even if you don't know your quark from your meson, your contributions to open-source software are helping physicists at the LHC and around the world.
—Carl Lundstedt, University of Nebraska Grid Computing Center, with edits by Miriam Boon