There are about 3 billion nucleotides in the 23 chromosomes of human DNA. For each nucleotide, there are four different nucleic acids (G, C, T, and A) to choose from, so each nucleotide contains 2 bits of information, and the total (uncompressed) data in the human DNA is about 715 Megabytes. Only about 3% of all human DNA actually codes for proteins, and the rest is ignored and generally referred to as “junk DNA.”
So the total genetic information necessary to fully specify a human is about 25 Megabytes.
That’s it! And that’s before compression. I definitely have tarballs of source code that are bigger than that. You could fit your whole family and all of your friends on your iPod (Update: Jamie pointed out that this doesn’t include the DNA to specify all of the bacteria in our digestive tracts without which we couldn’t survive, more details in the comments). (Update: I’ve recently made friends with some biologists who tell me that the concept of “junk DNA” is now widely disputed, and that much of the non-protein-coding DNA are control data. Also, just because we don’t know what it does, doesn’t mean that it’s junk. In any case, 715Mb is still a relatively small amount of data — a little more than one TV show downloaded from iTunes.)
Most non-junk DNA is identical across the human population. A protein-coding nucleotide which varies in more than 1% of the population is called a SNP, or Single Nucleotide Polymorphism. Of the 90 million protein-coding bases in our chromosomes, there are maybe 3 million SNPs coding for differences like eye color and sickle-cell anaemia.
There are now a few companies offering low-cost partial sequencing of your DNA by mail. Mostly these companies act as front-ends to a couple labs (Illumina and Affymetrix) that use chip-based sequencing machines to sample between 500k and a million nucleotide variations (SNPs) from your chromosomes. You FedEx them a test tube full of your saliva, they send it off to a lab to get your cells cultured and your DNA sequenced, and then they put your genetic information online for you to view. Cool, right?
The best-known of the personal genomics companies is 23andme, but there’s also DecodeMe and Navigenics. They charge about $1000 to decode 500k bases, which is about 120 kilobytes of genetic information. That’s a cost of about 0.2 cents per base or 0.8 cents per byte. That is a lot cheaper than it used to be, and the cost of decoding a nucleotide is dropping exponentially on curves reminiscent of Moore’s Law.
The various personal genomics companies don’t let you donwload all your raw SNP data; they map the base pairs to a handful of genes and in the end you only get a few bytes of actual data. They also look at your mitochondrial DNA (which is passed to you from your mother directly in her egg cell that becomes you) and some little bits on the Y chromosome that don’t change between individuals to determine your likely ancestry. At least, that’s what I’ve gathered from their web sites.
But I’ll let you know soon! My spit kit arrived from 23andme yesterday and they should have my DNA on-line for me to view in about 6 weeks.

When you order the kit, you have to read through some pretty interesting disclaimers:
You give permission to 23andMe, its contractors, and assignees to perform genotyping services on the DNA extracted from your saliva sample and to disclose the results of analyses performed on your DNA to you and others you specifically authorize. You are guaranteeing that the sample you provide is your saliva; if you are completing this consent form on behalf of a person for whom you have legal authorization, you are confirming that the sample provided will be the sample of that person. If you are a customer outside the U.S., by providing your sample, you confirm that this act is not subject to any export ban or restriction in the country in which you reside. You are warranting that you are not an insurance company or an employer attempting to obtain information about an insured person or an employee. You are aware that some of the information you receive may provoke strong emotion.
Personally I think having more information about myself can only be a good thing. Because you can act on it. A gene coding for a prostate cancer predisposition isn’t a death sentence — it’s a call to action. Eat better, get exercise, get checked every year after you’re 40. That sort of thing.

The New York Times has an article about these personal DNA services.
Posted on 9 February 2008
- Leave a comment
- Subscribe with Google Reader
- Follow me on Twitter
Did you like this article?
-
Only about 3% of all human DNA actually codes for proteins, and the rest is ignored and generally referred to as “junk DNA.â€
Besides coding DNA and junk DNA, there is also regulatory DNA, which controls things like when coding DNA is active.
I can hardly wait until you can get your entire genome sequenced for a few thousand dollars.
PS Is there a way to make the submission not require JS? I use NoScript and ended up retyping my comment after granting temporary permissions.
-
I think you underestimate the number of bits necessary to reconstruct a human, because you’re ignoring the thousands of other distinct genomes that are needed in order to bootstrap a functioning body, e.g., the many thousands of species of intestinal bacteria without whom we completely fail to function.
-
May I recommend looking at taking out a copyright on your genome, just in case you have some important DNA like the highlander gene and don’t want a company discovering it and calamining copyright of it or want to try and stop someone cloning you.
http://www.creativetime.org/programs/archive/2000/DNAidBillboard/dnaid/copyright-instructions.html
-
interesting, if you have another 1000 bucks to lose, you can paypal me I’ll do you a nice handdraw sketch from your picture in return
Anyway, this is a cool article and biometrics are only at their beginnings. Soon the time will come when we live like in “demolition man”.
Actually there is still a growing IRIS picture base, which is told to be unique to each individual, just like fingerprints.
-
knome offers full genome sequencing now, but its $350k. The SNP thing (23andMe) is crazy–they dont provide much info. buzz around the biology community (of which I am a part) is that someone is going to offer sequence of all of your genes soon, leaving out the the ‘junk’. hopefully it will be much less than knome.
-
Did you find out anything interesting from 23andme about your genome? I recently interviewed at a computational genomics startup, and this is blowing me away. Right now, it would cost around $100k to sequence your whole genome, but that is getting 10x cheaper every 4 years (or possibly faster), which means that in 8 years it should cost under $1k.
When my son was 12 weeks from conception, a doctor asked us if we wanted to check to see if he had down syndrome (or possibly other things) so we would still have the option to abort. I imagine that in 8 years people will start aborting left-handed or green eyed fetuses. It will certainly raise a lot of ethical issues. God help us if they find a “gay gene”.
-
linux ??????????? ??? ???????? Portable Formats Specification, Version 1.1
Tool Interface Standards (TIS) ??????? linux 25ELF: Executable and Linkable Format
A symbol’s type provides a general classification for the associated entity.
Figure 1-18: Symbol Types, ELF32_ST_TYPE
???
Value
_ _____________________
STT_NOTYPE
0
STT_OBJECT
1
STT_FUNC
2
STT_SECTION
3
STT_FILE
4
STT_LOPROC
13
STT_HIPROC
15
_ _____________________
?
?
?
?
?
?
?
?
?
STT_NOTYPE
The symbol’s type is not specified.
STT_OBJECT
The symbol is associated with a data object, such as a variable, an array, etc.
STT_FUNC
The symbol is associated with a function or other executable code.
STT_SECTION
The symbol is associated with a section. Symbol table entries of this type exist pri-
marily for relocation and normally have STB_LOCAL binding.
STT_FILE
Conventionally, the symbol’s name gives the name of the source file associated with the
object file. A file symbol has STB_LOCAL binding, its section index is SHN_ABS , and it
precedes the other STB_LOCAL symbols for the file, if it is present.
STT_LOPROC through STT_HIPROC
Values in this inclusive range are reserved for processor-specific semantics.
Function symbols (those with type STT_FUNC ) in shared object files have special significance. When
another object file references a function from a shared object, the link editor automatically creates a pro-
cedure linkage table entry for the referenced symbol. Shared object symbols with types other than
STT_FUNC will not be referenced automatically through the procedure linkage table.
If a symbol’s value refers to a specific location within a section, its section index member, st_shndx ,
holds an index into the section header table. As the section moves during relocation, the symbol’s value
changes as well, and references to the symbol continue to ”point” to the same location in the program.
Some special section index values give other semantics.
SHN_ABS
The symbol has an absolute value that will not change because of relocation.
SHN_COMMON
The symbol labels a common block that has not yet been allocated. The symbol’s value
gives alignment constraints, similar to a section’s sh_addralign member. That is, the
link editor will allocate the storage for the symbol at an address that is a multiple of
st_value . The symbol’s size tells how many bytes are required.
SHN_UNDEF
This section table index means t http://linux-miheeff.ru linux ??????? ??? windows

10 comments