Feature - Rethinking scientific data management
Despite all the good that science has wrought over the years, the way we manage scientific data is fundamentally flawed.
Sir Isaac Newton once said, “If I have seen further it is only by standing on the shoulders of giants.” Scientists stand on the shoulders of their peers and predecessors via peer-reviewed literature. The idea is that the literature is reliable by dint of being peer-reviewed, and thus researchers can safely build upon what they learn from it. Yet neither the reviewers who admitted those papers into the annals of scientific canon nor the scientists who wish to build upon it have access to the data used to produce those papers.
That means that they cannot ensure that they stand on solid ground by examining the data and doing their own analysis. They cannot analyze the data using alternative methods. And they cannot use it to address additional research questions.
Indeed, although the papers are preserved for posterity, there is no guarantee that the data will be; even if a researcher is inclined to share his data with anyone who asks, in the absence of a well-designed system for data preservation the data could be lost in any number of ways before he has a chance to pass it on to other researchers.
From ancient archives to digital data
It’s all information to librarians, and they’ve been in the business of preserving knowledge for centuries.
“We think that it’s important for libraries to bring data into their environments,” said Sayeed Choudhury, the principal investigator for the Data Conservancy project. “I happen to think of scientific data as the new form of special collections.”
The Data Conservancy is a virtual organization that is researching and developing a data curation infrastructure to support the preservation, use, and reuse of scientific data. It is funded via the US National Science Foundation’s DataNet program, which has funded one other project, DataONE, and plans to fund up to three others. The collaboration has ten partners and many other unfunded participants from around the world.
Choudhury, who is based out of the Johns Hopkins University Sheridan Libraries, hopes to harness the strengths of research libraries to accomplish the Data Conservancy’s goals. The potential advantages are numerous. Because libraries serve areas of academic inquiry ranging from art to theoretical physics, libraries curating data will enjoy certain economies of scale. And unlike individual research projects, which have a start and end, libraries are institutions that are expected to continue to exist indefinitely, making them ideal for long-term preservation and archiving.
In some ways, the Data Conservancy’s library-based heritage has informed their architectural choices.
“The data model that we’re working on is a collection-based data model that was developed for preservation processes,” Choudhury said. “I think it’s a good start.”
The Data Conservancy’s architecture must meet a wide variety of requirements. It must be sufficiently agile to adapt as computing technology evolves. It must be capable of handling a variety of security restrictions on data. It must be compatible with other online resources, including discipline-specific portals and applications. It must effectively track data provenance, so that no piece of metadata is lost along the way. And it must be able to handle a large number of files, regardless of file size.
“The storage systems that we've been dealing with, what we’ve found is that they’re not adequate for the scale and complexity of the scientific data we’re dealing with,” Choudhury explained. The system they design will have to overcome those difficulties.
Enabling flexible access over time
As with many other long-term cyberinfrastructures, the Data Conservancy team is focusing on developing a service layer that can connect to existing APIs, such as the Virtual Observatory, a popular application among astronomers; users will be able to access the data through those externally maintained software applications. Researchers will also be able to access data sets using a web portal developed by the Data Conservancy, which will incorporate analysis and visualization tools chosen for their applicability to many fields of science (e.g. image manipulation), Choudhury said.
In the short term, this approach means that virtually any science gateway, analysis software, or other application will be able to take advantage of the Data Conservancy. In the long term it means that as technology evolves, the Data Conservancy will only need to keep that service layer up-to-date to remain useful.
Some of the information will live on equipment owned by the Data Conservancy – in fact, they’ve already acquired some servers and tapes. But their equipment is only meant to be a single node on a larger network of library-based nodes. Additional resources for analysis and data storage could also come from a grid, cloud, or through partnerships with organizations such as TeraGrid.
Some of these options, however, introduce further challenges. In addition to the usual security concerns regarding who can access data and the strength of the security enforcing those restrictions, some researchers may also require that data be stored within a specific country, something that some clouds and grids cannot guarantee.
The open data movement
Of course, if researchers chose to release their data openly, many of these security concerns would disappear entirely. Known as the open data movement, this concept has the support of a long list of government and scientific organizations.
Nonetheless, adoption of these practices has been slow. There are many reasons why, but the simplest reason is that until the advent of the Internet, and in particular of online repositories, it just wasn’t practical. In fact, data has been completely private for so long that exposing it to the scrutiny of others may seem practically obscene. From a personal perspective, revealing your data to the world can feel like baring your soul to a room full of strangers—terrifying, regardless of how much confidence you have in your work.
From the perspective of a scientific culture in which data has always been private and everyone competes for tenured positions, it also may seem only fair that the beneficiaries of a data set should be those who labored to generate it. To some researchers, anyone who wishes to use data without doing the legwork is, simply put, a freeloader.
Of course, there exist more practical concerns about data security. The data from military research, for example, may never be open. And data from clinical medical research comes with a whole host of privacy concerns that are all about protecting the privacy and rights of the individuals who volunteered to participate in the experiment; legally and ethically protecting the rights of those people poses a variety of technical challenges.
Today, numerous large-scale cyberinfrastructure projects have chosen to emulate particle accelerators, functioning as a distributed virtual user facility that gathers and disseminates data. Their focus is on designing, creating, and maintaining the scientific instruments that gather data, and then making that data available - not on doing anything with that data. Some make the data they gather freely available to the public, while others provide access only to scientists. Both approaches effectively leapfrog over the issues surrounding proprietary rights to data, since their entire purpose is to get data to any and all scientists. The result is a proving ground where we can observe what happens when data becomes freely available.
Because they consist of large, complex, varied, pre-existing, and growing collections of data with no accompanying proprietary issues, these data sets are ideal for the Data Conservancy's purposes, to begin with.
A handful of online journals likewise serve as proof of concept for the “open journal” paradigm, in which papers—and sometimes the associated data—are available for anyone to read. When you read these journal articles, and you think of a question about the data that is not answered in the paper, you can answer it yourself. If you have a research question related to the paper’s topic, and the original researcher happened to gather the data you need, you can perform your analysis using that data rather than spending grant money to repeat the experiment.
“One of the prototypes we have is using a system called arXiv.org; we’re testing that out as a sandbox,” Choudhury explained. “We intend to have a pilot system up and running where people will be able to deposit the data if they wish when they submit their preprint.”
Don’t touch that – you don’t know where it’s been
Just because data is freely available—or even attached to a peer-reviewed paper—does not mean that you will want to use it. To make sense of data, we need to know its provenance. This consists of metadata that will tell us the life story of a data set, including information on how, when, and with what equipment it was recorded, what’s been done to it since then, and where it has been.
“We do use a great deal of metadata to provide the context, the provenance, and annotation of the data themselves. Some of them are machine generated in the case of, say, astronomy. Others, in cases of geology data we’ve been looking at, it’s human-designed and -entered,” Choudhury said. “What we are beginning to question is whether or not this approach is scalable and if it captures all of the information that is needed in the long-term.”
Provenance is just one of many areas that call for further research and development.
“Fundamentally we need to learn what it means to preserve scientific data,” Choudhury said. “We also need to have a good understanding of cross-disciplinary and transdisciplinary types of data.”
The ability to serve the data-related needs of tomorrow’s basic science questions is at the core of the Data Conservancy’s mandate. And since the basic science questions we face shift and evolve over time, that means they will have to continually evaluate the technical requirements embedded in those questions and the data management systems needed to meet those requirements.
That’s why the Data Conservancy team hopes to permanently embed research and development groups within research libraries.
For example, “there are a number of scientists at Cornell that are working on an ontology based on observation,” Choudhury said. “We believe that the observation is a concept that is common to all sciences.”
Research groups like the one at Cornell serve a dual purpose. According to Choudhury, there is a shortage of data scientists with specialized domain knowledge. This shortage will only get worse, if pundits' predictions about the rapid increase in production of scientific data prove to be true. Research groups provide a place to mentor the next generation of data scientists.
The Data Conservancy is a little more than one year into a five-year grant of $20 million US dollars. What will happen at the end of that grant remains unknown. Nonetheless, it is clear that sustainability is an important consideration; a sustainability team that includes MBAs from Johns Hopkins University’s business school is exploring options in search of a solution that might make the Data Conservancy self-sustaining.
Until then, their goals are clear.
Said Choudhury, “I don’t think scientists really care how their data is being preserved and how they managed to discover new data sets to run new and interesting analyses; they just care that it works.”
—Miriam Boon, iSGTW