Tuesday, March 26, 2013

On the Storage Needed to Capture Human DNA

Ever since Watson and Crick solved the DNA puzzle, in 1953, and shared a Nobel prize in Physiology or Medicine in 1962 with Wilkins, there has always been interest to store and retrieve DNA information.

However, it was not until the Human Genome Project concluded in 2003, that the entire human genome (of about 3 billion nucleotides) was considered fully decoded 1.

Now, what would it take to store the entire human genome on persistent storage, such as in the cloud? This type of information may be eminently suited to make use of inexpensive cloud storage such as the Amazon Glacier for about $0.01/GB/month. That is, if you simply stored each nucleotide in one byte, you'd pay $0.03/month. Yes, only 3 cents a month!

Or, of course, you can buy a TB hard disk for less than $100, and be done with it: you'd only use less than about 0.3% of the drive.

Now, since the 3 billion nucleotides are formed from the alphabet space of 4 nucleotides — A, C, G, and T — one could encode these in only 2 bits, thus reducing the storage needed to 1/4 of 3 billion bytes, or about 750 MB. For example:

Nucleotide Encoding
A(denine) 00
C(ytosine) 01
G(uanine) 10
T(hymine) 11

1In April 2003, it was believed that 99% of the human genome was indeed captured to an accuracy of 99.99%.

No comments:

Post a Comment