Share |

Transferring DNA in digital

Image courtesy Miriam Boon.

I think I can speak for most biologists when I say I never thought I would be worrying about file transfer. Compute power, yes. Storage space, maybe. But file transfer? Never.

Unlike some other scientific disciplines, biology is not traditionally a ‘big data’ science. Generally, biologists produce data on the scale suited to e-mail attachments. However, seemingly overnight, biology has been propelled into the ranks of the big data sciences. Now a biologist can easily find herself confronted with terabytes of data. Why the change? The answer lies in the recent quantum leaps in DNA sequencing technology.

DNA sequencing itself is not new. Molecular biologists have been sequencing DNA for decades using a technique developed in the late 1970s. But in 2005, a series of discoveries culminated in the advent of the next generation sequencing technologies. As a result, sequencing costs are plummeting and efficiency is soaring. The first human genome, completed in 2003, took approximately 13 years and $300 million in materials costs. Today, a human genome can be sequenced in a week for $1000 in materials costs.

Did you know?

'Next generation sequencing' is actually a collection of several distinct sequencing techniques, but one common characteristic is that, unlike earlier technologies, they sequence DNA on a massively parallel scale.

Now that DNA sequencing has become relatively fast and cheap, huge data sets are available to a wide spectrum of biologists — spanning from the microbiologist who wants to identify the genomes of all microbes hosted by the human body (did you know that 90% of the genomes in your body are microbial, not human?) to the clinician who wants to know how the genomes differ between her diabetic and healthy patients. In fact, DNA sequencing has become so cheap and so fast that some scientists predict that in less than 10 years the number of sequenced genomes will increase from the current count of a few thousand to over 200 million.

Which brings me back to file transfer. By now, most biologists have at least heard about, if not experienced, the compute power and storage crisis precipitated by next generation sequencing data. But I suspect few have considered the additional problem of data movement. Yet, thousands of next generation sequencing instruments around the globe are producing massive amounts of data that must routinely be moved off the instruments and on to compute clusters and storage devices. My husband, a molecular geneticist, summed up a typical attitude when he said to me, “What’s the problem? Just move the files from here to there.”

The problem is that, although scientists running sequencing centers are highly trained experimentalists, they generally have very little IT experience or support. To many, command line interfaces, checksums, firewalls and the like are all awkward, if not totally foreign. In many cases, the usual methods of scp, sftp, or rsync are so inefficient, unreliable, or complicated that sequencing centers resort to shipping hard drives.

If it’s not that simple, it won’t be adopted.”

So, any file transfer solution universally adopted by sequencing centers will need to be as easy and reliable as dragging and dropping files onto a hard drive, leaving the hard drive at the FedEx box, and walking away feeling confident that you are done. If it’s not that simple, it won’t be adopted.

We are implementing Globus Online at the University of Chicago Sequencing Center, and based on what I’ve seen so far, it meets the three key qualifications of secure, high-speed, and easy to use, with a drag and drop interface. I won’t be surprised to see Globus Online become part of the computing infrastructure that supports the genomic age.

Although next generation sequencing has been around for almost a decade now (in fact, ‘3rd generation’ technologies are now coming onto the market), I’m still awe struck by the impact that one leap in technology has had on an entire scientific field and beyond. And I get a shiver of excitement when I think about the growing mountain of data that is just waiting for us to explore!

A version of this article originally appeared in the Globus Online blog.

Your rating: None Average: 4.8 (8 votes)

Comments

Post new comment

By submitting this form, you accept the Mollom privacy policy.