Share |

Crowdsourcing the dark matter of biology

Image courtesy JFantasy, Wikimedia Commons.

Faced with the daunting prospect of profiling the complexities of the immune system, researchers at Harvard Medical School/Business School enlisted the help of the world’s largest community of software experts on the site TopCoder.  A recent paper in Nature Biotechnology indicates a cultural shift in academia with experts engaging the collective skills of those outside their community, in order to help them overcome methodological barriers to their work.

The problem in a nutshell

Ramy Arnaout is a systems biologist who investigates the extraordinary complexity of the immune system. He describes this area as “the dark matter of human biology”, since profiling the immune system is challenging due to its highly adaptive nature and the limitations of current investigative tools. Each new immune cell, whether it’s an antibody-producing B-cell or a T-cell, contains the products of different gene segments (VDJ) which are switched, mutated and recombined, creating in each case a unique immune cell with its own specificity. This means that a small number of gene segments (<100) can lead to a huge number of different cells (~1030), and it is this diversity that makes profiling and annotating the genetic sequencing of immune cells via high-throughput next-generation sequencing (NGS) challenging. NGS primarily involves sequencing and then mapping genetic material back to a reference genome. “Unfortunately, you can’t just take an immune cell genetic material and map it back to the genes that it came from, as what you get [immune cells] doesn’t come from a single gene but from a combination of gene segments,” says Arnaout.

In 2009, Arnaout developed a workable biologic algorithm for making sense of the diversity of data. However, he also foresaw that his algorithm ran the risk of becoming impractical as the volume of sequence data grew.

Crowdsourcing a bioinformatics problem

Step forward, Eva Guinan, professor of radiation oncology at Harvard Medical School and Karim Lakhani, of the Technology and Operations Management Unit at Harvard Business School. As part of a team at Harvard Catalyst, Guinan and Lakhani were investigating open and innovative approaches in tackling big data problems in medical research, and were looking for case studies.

The three researchers, inspired by Joy’s management principle (No matter who you are, most of the smartest people work for someone else), tapped into the expertise of an online community that liked to solve these problems – 462,000 algorithm enthusiasts and software developers on the commercial crowdsourcing platform, Topcoder

Gene segments to substrings

Arnaout had to first interpret or translate his domain specific biological problem into a more generic problem that a computer scientist or coder could tackle. Instead of the task being centered around, ‘how V, D and J gene segments mix and recombine to form an antibody gene’ he translated it to a problem statement with no reference to biology – a string concatenation task. The goal now was to look at a final string and predict the most likely set of substrings from A, B, and C that contributed to form that string. “There are some biological problems that are easier to turn into this language depending on the underlying concepts, for example evolutionary problems are easy to interpret,” says Arnaout.

Within two weeks of launching the contest, Arnaout was bombarded with 89 viable solutions from 122 individuals. Sixteen of the submissions were more accurate than the code that Arnaout had initially written. Each was scored according to both quality and time, against industry references such as BLAST (which finds regions of local similarity between sequences) and had to take less than 30 seconds on a standard computer to process 1 million unknown sequences.

The biological data bottleneck

Next generation sequencing techniques yield enormous amount of data – sometimes terabytes of data per sample. This method could be 1,000s or 10,000s of times more productive over the next decade. But there is a shortage of skilled staff to both develop diverse algorithms and integrate them seamlessly into scientific workflows and big-data architectures. By 2018, it has been estimated that up to 200,000 data scientists will be needed to meet the increased demand in the US alone. Crowdsourcing could provide a viable solution for optimizing methodology. “Out of a total of 733 participants, there were no computational biologists. This was surprising, but it meant that we were able to ‘tap into’ an entirely different skills set that we wouldn’t have otherwise been able to reach without a collaborative approach,” says Arnaout.

 

Accuracy score plotted against the speed of contestants submissions. The above shows the top 70 final submissions (top ten in red circles; remainder in unfilled circles), MegaBLAST (triangle) and Ramy Arnaout’s code (square). Image courtesy: Nature Biotechnology

Crowdsourceing is also extremely cost effective compared to the alternatives. “We were able to produce a massive amount of output in a much more cost efficient way. For around $6,000 and two weeks of collective effort, we had access to the expert computer scientists at a rate of $2.50 an hour”, says Arnaout. Contestants also spent on average three working days  on developing solutions (approximately 22 hours), and provided on average 5.4 submissions each. 

Open and participatory science 

Karim Lakhani, an expert in the econometrics of team formation and open source software, is also examining the collaborative process to understand whether it is more efficient to be open (letting everyone see everybody else’s work in progress) or closed (everyone working on their own), and plans to publish a paper with his findings later this year. The researchers found that what motivated contestants to take part seemed to be down to a combination of factors (e.g. small financial component, the joy of competition, etc.). However, there is also a degree of “street cred” when applying for a computer science job, there is definite kudos to say that you’ve won a TopCoder competition, Arnaout suggests.

This is also not the first time computational biologists have benefited from academic outsourcing. Scientific problems have been successfully re-packaged in online computer games (e.g. the protein folding puzzle, FoldIt) and also as competitions for those within academia (e.g. the CLARITY project).

The boundaries between researchers and amateur enthusiasts may be blurring, and what motivates those to take part is becoming a research field in itself. Researchers at the Open University, UK are examining how digital web 2.0 technologies are changing the way scientists engage with the wider public. Vickie Curtis has surveyed citizen scientists from FoldIt as part of her PhD thesis. “One of the most important reasons that people play Foldit, is the community aspect of the game, and many players relish the opportunity to collaborate with the other participants and project scientists," explains Curtis. This raises new issues for how open, networked and participatory the scientific process can and should be.

Since publishing their work last month, the Harvard team has already been approached by a number of researchers in the fields of cancer and genomics. 

Your rating: None Average: 4.4 (5 votes)

Comments

Post new comment

By submitting this form, you accept the Mollom privacy policy.