The scientific method has been the most successful contributor to systematic progress in the history of human endeavour. One of the key elements of the method is that if the result cannot be reproduced, it is discarded. Models are then developed consistent with non-discarded work to see if they can make further predictions, which can be tested.
This has not been the case for scientific computation – which has been taking on an increasingly important role in science over the last few decades.
The main author of this article, Les Hatton, co-authored an opinion piece in February in the British journal Nature, calling for consistent regulation of the release of source programs by researchers:
“Although it is now accepted that data should be made available on request, the current regulations regarding the availability of software are inconsistent. We argue that, with some exceptions, anything less than the release of source programs is intolerable for results that depend on computation. The vagaries of hardware, software and natural language will always ensure that exact reproducibility remains uncertain, but withholding code increases the chances that efforts to reproduce results will fail,” they wrote.
One part of the puzzle is measuring the approximate defect densities in software. It can be measured, for example, as the total number of defects ever found divided by the approximate number of lines of code involved. Typically, this measurement will be somewhere in the range 0.1– 10 per thousand lines of code.
Even though researchers have made some progress in measuring the density of defects, there has been little progress in quantifying the effects of those defects on the computational results. A really terrific piece of code, with 0.1 defects per thousand lines of code, could have a really serious defect, while a fairly awful piece of code with 10 defects per thousand lines of code could turn out to be quite accurate.
I (Hatton) have worked for 40 years in meteorology, seismology, and computing, and most of the software I’ve used has been corrupted to some extent by such defects – no matter how earnestly the programmers performed their feats of testing. The defects, when they eventually surface, always seem to come as a big surprise.
The defects themselves arise from many causes, including: a requirement might not be understood correctly; the physics could be wrong; there could be a simple typographical error in the code, such as a + instead of a - in a formula; the programmer may rely on a subtle feature of a programming language which is not defined properly, such as uninitialized variables; there may be numerical instabilities such as over-flow, under-flow or rounding errors; or basic logic errors in the code. The list is very large. All are essentially human in one form or another but are exacerbated by the complexity of programming languages, the complexity of algorithms, and the sheer size of the computations.
As an example, here is a reconstruction of a vertical slice through a North Sea gas field from seismic data. It looks very convincing to a geologist, and producing it required large amounts of computation on high-performance computers using software that was reliable and well tested by a highly responsible company. It was considered state of the art.
However, we then performed a reproducibility experiment, which took three years, using the same processed data from eight other companies, the same algorithms in the same programming language, and the same input data, but coded independently. You then get the collage.
Individually, they all look very convincing but they are significantly different to a geologist, even though they are supposed to be the same. It turned out that these differences are entirely due to latent software defects that have lain hidden, for years in some cases, before being flushed out by this reproducibility experiment. In this case, the latent defects included the well-known ‘one-off' array indexing problem, uninitialized variables, sign errors in wave-propagation algorithms, simple logic problems whereby unexpected program paths were followed, and incorrectly calculated geometries.
The cumulative effect of these defects meant that we could only reproduce the results to one or two significant figures rather than the six inherent in 32-bit floating point computations. There was no prior warning to the programmers or the end-users that this could be the case. Defects of this size greatly undermine the accuracy of this process, which needs at least three significant figures for these data, and can easily compromise the placing of an extremely expensive drilling rig. Without this comparison, the defects responsible may never have been unearthed– they had already evaded comprehensive test suites and years of production use.
The methods used to develop the software haven't really changed much since the experiment was done in the 1990’s. What has changed, however, is the volume of software used in science and the volume of data processed. Where we once had megaflops and megabytes, we now have petaflops and petabytes.
In the past, one approach to this problem was to develop a great code, and then not allow any changes to be made. This approach was taken by NASA in the 1970s, when scientists there developed its most defect-free code for the space shuttle program.
“Software can never be considered error-free; the problem is to determine when it's reliable enough to fly with,” said Hugh Blair-Smith, who was part of the team that worked on the Space Shuttle software as part of MIT's Instrumentation Lab, and author of Journey to the Moon: The History of the Apollo Guidance Computer.
Its code had an estimated defect rate of 0.11 per 1,000 lines of code. But, this solution was an expensive venture. At the time NASA paid IBM programmers a reputed $500 million to debug 500,000 lines of code.
“Most software at work today is developed by methods very different from what we did then. Instead of having every line of code in the machine known and controlled by a small group of people all working together, every small module of modern software rests on APIs to a dozen or so layers of infrastructure code modules created by hundreds of organizations employing myriads of people. While the integrators of these pyramids benefit from very detailed specifications for each of the bricks used at their level, and create similarly detailed specifications of how their pyramids behave for the benefit of the next layer up, the opportunities for obscure problems are many orders of magnitude greater than anything we saw then ... as anybody looking at a frozen screen knows!”
This approach is not possible at the ATLAS experiment, which is one of four particle detectors running on the Large Hadron Collider at CERN. It has about five million lines of code. And the software is constantly evolving, with improvements, clean-up, and bug fixing.
“Last year, around 300 people worked on our code. This could be a student changing one line or an expert updating thousands of lines,” said David Rousseau, a former coordinator of offline software for data reconstruction, analysis, and simulation.
They have tried organized code review in the past, said Rousseau, but with little success due to the small number of true software experts. There were also some false starts trying to run the software on different platforms (for example, Linux versus Mac operating systems), which can reveal different defects in the code.
Instead, ATLAS uses a kind of semi-open source software; the code can be accessed by everybody in the ATLAS collaboration, which is 3,000 people. They have user tutorials, to help researchers without intense coding experience. And, lastly, they have a series of consistency checks.
“Our system runs this automatic comparison every night and morning to catch any bugs before they’re introduced into the work chain,” said Rousseau. A developer is asked to announce when his or her update might reveal in a change to the scientific results. The next day, the team checks the results against the developer's claim to see if it a real.
The main ATLAS software can be split into 10 different areas for specific physics research, which are each overseen by a few experts. But, these different software areas can sometimes interact with each other in unexpected ways. “After user updates of individual software packages and before any major release, we get experts to manually compare results, line by line, for any potential side effects across all our code,” said Rousseau.
Plus, of course, if the ATLAS experiment claims a discovery – such as the Higgs Boson – it must be confirmed by another experiment, CMS, which also takes data from proton collisions in the LHC.
So, thus far, the best scientists and researchers are simply aware of the problem, and are vigilant about checking results. But every domain of science has problems related specifically to their code, and to their unique ways of coding and debugging. Computing science has failed to reveal any single method to avoid or even quantify defect.
In the meantime, all scientists need to return to the method that has made science so successful in the first place: reproducibility. If scientists can't reproduce a result or if the source code and data are not available to test reproducibility, then that result should be treated with caution. And, if we are to be really brave, it should be discarded.