Is the grid secure?
It’s a question of critical importance, and of course it doesn't have a simple answer. Just like all IT systems, the grid is under constant attack and requires vigilant security teams.
“It’s a question of being prepared,” said Sven Gabriel, a security officer at NIKHEF, the national institute for subatomic physics in the Netherlands. His team of security experts simulated an attack on the grid using a virus-like security test payload in May 2011 as part of an annual security challenge. “We mimicked a global attack on the grid infrastructure, and the ‘infection’ was spread around 40 different sites in 20 countries,” Oscar Koeroo, a grid middleware security developer at NIKHEF, said.
Preparing for any kind of attack on the Worldwide Large Hadron Collider Computing Grid (WLCG) is a big task, as it services more than 8,000 users and consists of more than 300 ‘sites’ – data centers, research institutes or computer farms – across more than 50 countries.
Each site is responsible for their own security – smaller sites are run with just a few people for handling the operations, while larger ones have a dedicated security team.
“While some sites have deep security and forensic knowledge, other sites are lacking this expertise,” said Gabriel. “It is important to identify those individuals [with specific expertise] and get them involved helping other teams.”
Each site was responsible for figuring out what the virus was doing on the site and what problems were being introduced to the site. They then had to locate the virus and shut it down. One of the key parts of security being examined, however, was the level of collaboration between sites. This is the fifth time such a challenge has been operated, but it was the first time the collaboration of globally distributed sites had come under scrutiny - previous challenges have only looked at the response of individual sites.
On May 26, the challenge started: the researchers infected the 40 sites with the virus, and then they sent the first alarm to a site in the Asia Pacific region. The Asia Pacific were targeted first, Koeroo said, because they were the furthest east, and thus could pass the information west as the rest of the sites started their day.
A Russian security expert, Eygene Ryabinkin, who is based at the Kurchatov Institute, the national nuclear energy research institute in Moscow, discovered the virus before he even received his alert. Ryabinkin traced the virus to the server in the Netherlands that the NIKHEF team were using, and was able to stop the attack.
Luckily, however, the NIKHEF team were able to communicate to Ryabinkin that it was just a test payload, not a real attack, and so they were able to continue after a few minutes, Koeroo said.
“Each site has a different level of expertise, one of the nice things about this challenge was that it identified who each of the security experts are within the grid team,” said Gabriel. “And Ryabinkin is one of them.”
Another example, Gabriel reported, was Daniel Kouril from Czechoslovakia, who retrieved data and analysis from their national network service provider to reveal potentially affected sites.
Of course, though, the challenge would not be complete if it didn't involve a large consumer of grid resources: the ATLAS experiment. The ATLAS experiment provided a copy of their computing infrastructure for the challenge so that the security teams could do a fully-fledged incident response without affecting the production system.
“We used the job submission framework at ATLAS to launch our ‘attack’,” said Koeroo.
The NIKHEF team made up a story where a user, Hegoi – who is, in fact, a real ATLAS user based at NIKHEF – had his laptop stolen at a conference. (Hegoi himself went along with it, going so far as to request a new laptop from NIKHEF, but that's another story.) According to this manufactured scenario, Hegoi's laptop contained his certificate - a password protected electronic user ID - which allows him to submit jobs to the grid through ATLAS.
After the NIKHEF team submitted the virus by using Hegoi's user ID, the ATLAS security team decided to stop their dummy infrastructure while they found which user who had introduced it. This exercise helped in understanding and addressing potential issues with the production services, said Gabriel.
"We already had an idea of the operational security situation [from previous challenges] at each site therefore we expected that a challenge at a global scale should also be possible and reveal more interesting facts,” said Gabriel. "Considering it was a challenge on that scale it went quite smoothly."
“It is obvious that a close collaboration of all security teams is needed otherwise a proper incident response will be difficult." Gabriel and his team now plan to run a similar test for many of the national grid infrastructures.