Share |

How can computational science surpass the software error plateau?

Diagram of the Four-Color Theorem which is a map colored with four colors.

The four-color theorem map: In mathematics, the four-color theorem says no more than four colors are required to color the regions of a map
plane separated with regions that are touching, so no two adjacent regions have the same color. The 1976 proof by Kenneth Appel and Wolfgang Haken of the four-color theorem in graph theory involved intensive computer analysis, causing considerable controversy within the mathematical community about whether the proof was fully valid. Image courtesy Wikimedia Commons.

Computation plays a crucial role in generating scientific results, but also adds to its complexity. The ability to reproduce code and data that generates results for peer review has been a problem since the 1970s and 1980s. Earlier this year, iSGTW wrote about the slow progress in quantifying the effects of software defects on computational results and that if scientists can't reproduce a result or if the source code and data aren’t reproducible, then that result should be treated with caution or even discarded.

In October 2012, a workshop about maintainable software practices in e-science highlighted that unchecked errors in code have caused retractions in major research papers. For example, in December 2006, Geoffrey Chang from the Department of Molecular Biology at the Scripps Research Institute, California, US, was horrified when a hand-written program flipped two columns of data, inverting an electron-density map. As a result, a number of papers had to be retracted from the journal Science. Now, RunMyCode.org aims to enable easy replication and sharing of scientific code and data. A research paper about the project is published on the Social Science Research Network (SSRN).

Don’t judge the code, share it

In the process of scientific publishing, a researcher may write a little or a lot of scientific code. Typically, it’s left to languish on a hard disk or computer system. This is where RunMyCode steps in. The site provides a cloud-based platform for scientists to openly share their code and data by creating a companion website that’s associated with a scientific publication.

“People don’t often think about the importance of code in the [scientific publishing] discussion. I think reproducibility can bring both code and data into that discussion,” says Victoria Stodden, a co-founder of RunMyCode and is an assistant professor of statistics at Columbia University in New York City, US, and a member of the National Science Foundation’s Advisory Committee for Cyberinfrastructure.

RunMyCode was also co-founded by Christophe Perignon of the HEC international business school in Paris, France, and Christophe Hurlin of the University of Orleans and Université Paris IX Dauphine, France. They both come from an economics background, but they want their tool to help all scientific disciplines.

Other than running a researcher’s code, the project doesn’t review or judge the code in any way.

It’s all about replicating

Replicating other peoples’ computations will enable science to get their dissemination house in order, says Stodden. The result: better quality science.

If a researcher wants to run their dataset and scripts on RunMyCode’s platform all they have to do is click the big orange button that says create companion website and create an account. The platform then asks them for information about their published paper and the researcher is required to upload their code and data.

The system verifies if the results of the paper, such as tables and figures, can be retrieved. Finally, a researcher has to click a green button which reproduces their code and the research paper results onscreen. This verification process of scientific results can be executed directly on RunMyCode’s servers or on a researcher’s local machine.

The platform can support code run on Linux, MATLAB, and other proprietary software formats. If a researcher resubmits or changes their data, RunMyCode creates an executable file which is managed by an in-house cloud service.

“We’re completely unique. I don’t know of anyone doing exactly what we’re doing but that’s not actually the correct framing. There are multiple solutions, multiple tools, and [computational] problems are granular. What works for one group might not be the right approach for another group,” says Stodden.

The service has been live since March 2012 and has the potential to address a number of problems, such as retractions of research papers.

Dealing with retractions

Screenshot of a researcher's RunMyCode website companion page.

Your very own companion page: All code on RunMyCode is written by researchers. Depending on the research paper, it can be a few short scripts or quite complex. A companion page usually contains information about the author, institution, title of paper,  journal it was published in, link to abstract, paper, and text describing the code itself. Image courtesy RunMyCode.

“Retractions are coming up fast and furious,” says Stodden. In August this year, a paper in the journal Hypertension had to be retracted because of a coding error that led to the doubling of a sample size and significantly different estimates.

“I don’t think reproducibility is the only thing to bring things to light. There’s certainly fraud that happens out there but I don’t see RunMyCode as geared towards fraud per se. With retractions, we’re definitely interested in research integrity but we see ourselves as a very positive force to enable scientists to do the research rather than a negative force trying to poke holes in the system or find people that are bad actors,” says Stodden.

Some researchers have been reluctant to share their code because they want to build companies around their code and seek patents. "Sharing code does not necessarily prevent you from patenting or starting a company. People can license their software in a certain way or creates permissions,” says Stodden.

Journals may be interested in whether or not their results are replicable, verifiable, or reliable, yet, they’re unwilling to ask their reviewers to take on the burden of rewriting or reviewing the code. Looking at a research paper’s code is the only way to see what’s going on.

Are we talking about data, code, or both?

“It’s not an exaggeration to say that the genomics and bioinformatics data has been extraordinarily influential in the life sciences and modern research,” says Stodden. The catalyst for the open-data discussion was heavily influenced by the race to decode the human genome in the late 1990’s. This was a genome war: on one side there were the scientists who wanted to put the data in the public domain and on the other side were the likes of Craig Venter, a US biologist and entrepreneur who wanted to restrict access and licence people to access data, says Stodden. That discussion about open data has continued to this day.

For scientific knowledge to move forward you can’t talk about data without talking about code – they are interconnected. Stodden says for empirical science, methods are often in the code, but to replicate empirically-driven results you need access to the data.

It’s fundamentally scientific

In the end, the ‘code-sharing’ topic is of great interest among all research disciplines. “We aren’t trying to tell anyone what the answers are, but are giving them tools to understand why -- it’s fundamentally scientific. Stodden says the different open-software communities that RunMyCode has talked to have shown a lot of interest.

“In a sense, what we’re doing is the genesis of the open-source software community. There are plenty of differences, but I think the philosophical connection makes them sympathetic to what we’re doing,” says Stodden.

But, Stodden didn’t just start working on RunMyCode out of the blue – it came from an inspiration.

“I'm partly to blame,” says David Donoho, at the statistics department of Stanford University who isn’t involved with RunMyCode. Donoho was her supervisor at Stanford University.

“Stodden has been working energetically for years to make computational science more transparent and reproducible. I imposed her to make her PhD thesis work freely available over the internet and reproducible. Since that time she worked on many issues arising from the lack of reproducibility in computational science. I suppose she found the RunMyCode’s approach a good match to her philosophy.”

Donoho says that RunMyCode will make a real impact by conveniently enabling others to reproduce computations without needing the exact software and computer. “You can even use it from a phone or tablet. It might become for computations what ArXiv.org has become for articles,” says Donoho.

Currently, the RunMyCode platform is suited to smaller data sets and scripts. The team is working on handling larger code bases and complicated data sets so they can support any published work. 

Your rating: None Average: 4.4 (12 votes)

Comments

Post new comment

By submitting this form, you accept the Mollom privacy policy.