While a scientific paper still remains the principal way researchers share their findings, the foundations of a project −the primary raw data – are becoming equally, if not more important. For science to be truly productive, widespread initiatives for opening up access to data are required, and fair policies defining how data is accessed are important, as well as clear incentives linked to research impact. Europe's leading neutron source, Institute Laue-Langevin (ILL), and other central facilities in Europe have been working together to develop a shared infrastructure to increase the availability of their data to scientists all over the world.
The ILL, nestled within the French Alps in Grenoble, France, feeds neutrons to a suite of 40 high-performance instruments, helping 2,000 visiting researchers perform over 800 experiments. Neutron scattering helps researchers in fields as diverse as material sciences, molecular biology, nuclear and fundamental physics with their research.
The resulting data avalanche makes it essential that the institute implements a framework for sustainable data management and analysis, as part of their service.“There is growing recognition that these data should be networked and preserved for future studies to reuse in replicating and validating scientific conclusions,” said Jean-Francois Perrin, head of IT services at the ILL.
Although the ILL monitors their published papers, of which there are more than 660 per year, the institute has a lot of different types of data (raw data and metadata), and protocols to collect and store data.
The scope of scientific data management is also broad. Not only does the raw data have to be curated, but its context has to be described (i.e. how, when, and by whom a particular set of data was collected and formatted). This is known as ‘metadata’ (e.g. experimental conditions, instrument type, date or time, compression algorithms, or software code).
ILL is one of 13 European neutron and photon laboratories in Europe and therefore a joint approach to data policy was important. This ambitious task was undertaken by the PaNdata Open Data Infrastructure project.
Although raw data has been published since the very first experiments were carried out at ILL in 1972, the institute suffered from a lack of metadata accessibility to allow further analysis and replication.Their solution was to develop a collaborative open access repository for the community to deposit their metadata, which has now been provided through the ICAT catalogue. “The fact that our data is openly accessible, strongly contributes to collaboration between scientists. Open access serves science and the scientists by creating opportunities and a better reward of our users’ work,” said Perrin.
Deciding what constitutes ‘open’ is particularly important when developing a policy. The concept of open data was first established over fifty years ago. However, a formalized definition was summarized recently: "a piece of data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike,” by the The Open Knowledge Foundation Working Group on open data in science.
Before developing a data policy, an organization has to fully understand and audit their requirements for data availability and management. The ILL undertook a year-and-a-half long consultation with its users from April 2010 until November 2011 before publishing their Common Data Policy, which will be applied this autumn.
International and national funding bodies are now also introducing policies to encourage a culture of data sharing. The National Science Foundation, in the US, requires data management plans, including provisions for access and sharing, to be submitted in conjunction with grant applications.
As the open data movement matures, one of the main barriers to access research data, recently highlighted in an EC survey published in April 2012, is a lack of national or regional data policies. However, trusted experts are busy developing strategies for handling their data. In May 2012, one of the largest international scientific experiments and collaborations ˗ the Large Hadron Collider CMS (Compact Muon Solenoid) ˗ announced its policy to manage and share its unique data.
But how do organizations balance the need for ‘openness’ in science with confidentiality, and security concerns? To safeguard researchers from being pre-empted, the ILL proposes a three-year embargo period.
“The ILL provides the beam, the instruments and experts, but it’s the user who produces the idea of the experiment and prepares the samples. After the experiment, the researcher needs time before releasing the data. This can often take a while or even necessitate more than one single experiment, and three years corresponds to this necessary gestation period, and also the typical duration of a PhD”, said Professor Helmut Schober, science director at the ILL. The CMS policy has a similar embargo period.
Perrin said that for the policy to be successful, scientific publications will need to explicitly cite, not only publications, but also experimental data and the teams which produce it.“Often scientists would like to access this information too, reuse the raw data, and as yet, a 'static' image of a graph in a traditional journal doesn't allow this,” said Perrin.
A number of metrics exist for measuring a publication's impact (H-Index, Impact Factor), but there are still less recognized methods for making data citeable. A number of European projects (DataCite, OpenAIREplus, and Opportunities for Data Exchange) are helping incentivize, and assist data sharing. In just the same way that you can cite other sources of information, such as articles and books, DataCite is creating a scholarly structure for identifying and referring to data that will facilitate recognition and reward for data producers.
Linking peer-reviewed research publications and datasets is also important. Building on the OpenAIRE project, which is providing a large-scale repository for European researchers to deposit and access articles and data, is the initiative, OpenAIREplus. This FP7-EC-funded project extends the OpenAIRE infrastructure coverage to include scientific data and is developing the concept of enhanced publications (EP) where research papers link to supplementary data.
Tim Smith, group leader of Collaboration and Information Services at CERN, has been heavily involved in the OpenAIRE and OpenAIREplus projects from the start, and said, “An EP is a compound object which groups together the paper with all associated items such as metadata, datasets, persons involved (by reference), and subsidiary articles.”
On June 11th 2012, OpenAIRE hosted a workshop: ‘Linking Open Access publications to data – policy development and implementation’ in conjunction with the Nordbib Conference 2012, in Copenhagen, on EPs and on developing data policies. Tim Smith also said that a network of national open access desks are on hand to advise repositories on best practice and users on where to store, and how to access publications. “While infrastructure and repositories are a necessary base, policies are there for vision and consistency, and a network of help desks is there to get the momentum going,” Smith said.
If e-science is to offer solutions to grand societal challenges and mysteries of the universe, it is clear that a new data dissemination model and common global policies will be needed to aid cross-disciplinary research communities.