Feature - Data for the people: how to fast-track your network and share the data love
So you’ve used a grid to split up your job, process it faster, then return your results. You now have a nice chunky terabyte of data. What do you do with it?
“In terms of impact on society, the ability to use transparently other people’s data is going to be transforming,” Grossman says.
“It is about ‘network effects’,” he continues. “In the same way that a network becomes more interesting as more people join it, you can draw more interesting conclusions about your own data if you put it into the context of other people’s data.”
A fine notion in principle
But how can you get these network-busting bundles of new data to the people who need them?
Simple, says Grossman. You just send them, to everyone and anyone who might like to take a look.
“Our motivation for the last ten years has been to create a web for data, so it’s easy to browse, explore and download it. The system we built, called DataSpace, still controls who can write data, but we encourage anyone in the world to read it.”
Driven by this ultimate goal, Grossman turned his eye to the networks: could they distribute large sets of data across thousands of miles, and all without wasting a second? No, not really, not at all.
Grossman describes the old faithful TCP internet protocol—still going strong after nearly 25 years—as “a huge success story,” but, he says, new versions of TCP just weren’t coming out fast enough to solve his problem in good time.
“It was clear the network would change, but we didn’t want to wait ten years for that to happen. So we built our own infrastructure instead.”
Enter the fast lane
UDT, or User Datagram Protocol (UDP)-based Data Transfer, is the result. Able to shoot data around the world at 10 gigabits per second, UDT compares well with the three or four megabits per second that standard TCP—as it was usually deployed—was achieving. “And if you’re impatient like me…” jokes Grossman, “…I know which one I’d prefer.”
For those keen on a more global challenge, UDT was used just last month to move 1.4 terabytes of SDSS data from Chicago all the way to Moscow. The transfer was complete in about 4.5 hours using a one gigabit per second link.
Even more exciting, UDT is now an option for gridFTP.
This progress points in some interesting directions for Grossman and his team.
“We want to lower the cost of getting hold of other people’s terabytes,” he says. “I want to be able to find out, in just a few minutes, whether someone’s data is going to be useful for my research.”
When asked about the policy of some collaborations in restricting who can access their data, Grossman replied: “You don’t have to have a PhD to be interested in data and to want to analyze it. And if you want to analyze it, you have to be able to touch it. We’re building that infrastructure.”
- Cristy Burne, iSGTW