The success of modern molecular biology might be considered a cartesian
dream. Reductionism, Rene Descartes' belief of understanding complex phenomena
by reducing them to their constituent parts - despite all its limitations
- has turned out to be a home run in molecular biology.
The developments in modern biology have their roots in the interdisciplinary work of scientists from many fields. This was a crucial element in the breaking of the code of life; Max Delbrück, Francis Crick and Maurice Wilkins all had backgrounds in physics. In fact, it was the physicist Erwin Schrödinger (ever heard about Schrödinger's cat ?), who in "What is life" was the first to suggest that the "gene" could be viewed as an information carrier whose physical structure corresponds to a succession of elements in a hereditary code script. This later turned out to be the DNA, one of the two types of molecules "on which life is built".
Linus Pauling, the chemist, vitamin C-ist and anti atom-bombist determined the structure of the other type of molecule, the protein molecule - that is chains made up of things called amino acids.
This work inspired James Watson and Francis Crick in 1953 to elucidate
the structure of DNA - the ABC of all known living matter. To cut a long
story short over the next years many people pieced the puzzle together:
The building blocks of life are the 20 amino acids that make up proteins;
DNA contains the blueprints for these structures in its own structure.
It is a long strand made of 4 nucleotides - this is the code of life. It
goes ACGTTCCTCCCGGGCTCC, and so on, and so on, and so on. If you
know the code you know the structure of all living things, at least in
An animation of Guanine (G), one of the 4 standard nucleotide bases. The colored balls represent the atoms from which it is made. Similar ball-and-stick models can be constructed for the 20 amino acids. (Click here if you'd like to `animate' the Guanine.)
Here is a summary of the relationship between DNA and protein:
Restless technology has produced means of reading genes (DNA) almost
like bar - code. The problem is that life is a complicated business, and
therefore the code to describe even the smallest of God's creatures would
fill many books. But scientists are very ambitious people and do lots of
over-time. They have started to decode "themselves" in the Human
Genome Project - HUGO for short. In fact, a sort of "average"
human is decoded sampling DNA from unknown donors. But the difference in
DNA between any human, and another one (or a scientist...) is almost null.
Nevertheless, an average human scientist is made up of about 2.9 billion
(2.9*109) nucleotides !
This orgy of reductionism presents problems which only big brother can solve: How do I store all this information in a form which is universally accessible and retrievable? What started as a cartesian dream is turning out to Bill Gates' satisfaction: Computers are needed !
Vast computer data banks accessible to you and me store this vast quantity of information. There are a lot of different data banks where DNA and protein sequence information are stored. Three examples are listed in the table below.
|Name of data bank||Type of sequences stored||Number of sequences (1996)|
|EMBL / GENBANK||Nucleotide sequences||827174|
The growth of one typical data bank is shown in below, the increasing number of sequences in the SWISSPROT data bank as time goes by.
An advantage of these data banks is their flexibility. All this information
can be ordered and combined according to different patterns and tell us
an awful lot.
The motto goes: don't just store it, analyze it ! By comparing sequences, one can find out about things like
Phylogenetic trees are genealogical trees which are built up with information gained from the comparison of the amino acid sequences of a protein like cytochrome C, sampled from different species. Proteins like Beta-amylase or Hemoglobin cannot be chosen to get the "full picture", that is the full tree, because they don't occur throughout the living matter. Due to Darwinian Evolution, the protein has a slightly different amino acid sequence for each of the species. One phylogenetic tree was created for instance with the sequences of cytochrome C from several plants, animals and fungi. Below, part of this phylogenetic tree is shown.
Drawing of a phylogenetic tree based on the amino acid sequence data of cytocrome C (see inset).
Prediction of protein structure from sequence is one of the most
challenging tasks in today's computational biology. More or less, the task
is to calculate an image like the one in the second
figure of this text. Although most information of 3-dimensional structure
is encoded in the amino acid sequence it is still unknown which information
controls the process of protein folding. Among millions of possible folding
products, proteins take up one working, native structure. Since it is very
difficult and expensive to evaluate structures by methods like X-ray diffraction
or NMR spectroscopy, there is a big need for the unfailing prediction of
3-dimensional structures of proteins from sequence data. Today there are
methods which are able to give a quite reliable result from available sequence
data, the odds to get this "right" are about 65%.
Sequence comparison is a very powerful tool in molecular biology, genetics and protein chemistry. Frequently it is unknown for which proteins a new DNA sequence codes or if it codes for any protein at all. If you compare a new coding sequence with all known sequences there is a high probability to find a similiar sequence. Often it is already known which role the protein in the data bank plays in the cell. If you assume that a similar sequence implies a similar function, you now have much more knowledge about your new sequence than before. (See also the contribution by Joelle Thonnard in this volume.)
Proteins of one class often show a few amino acids that always occur at the same positions in the amino acid sequence. By looking for "patterns" you will be able to gain information about the activity of a protein of which only the gene (DNA) is known. Evaluation of such patterns yields information about the architecture of proteins. Often these patterns are involved in active sites, which are the workbenchs of proteins.
A lot of complicated algorithms have been created. There are tools to
scan data banks for sequences as FASTA and BLAST are. There are programs
like Clustal and MSA for comparing sequences. There are hundreds more.
Although the development of new tools is more transparent because of the
possibilities of the Internet, it is not easy to keep up with everything.
Exploitation of these possibilities requires a new breed of scientist:
those versed in information technology AND biology, and they may
enable us go where no man has gone before. Through a new surge of interdisciplinarity
it may be possible to transcend the limits of reductionism; from the vast
quantities of bytes and pieces, the contours of complex structures and
relationships might emerge from the genetic alphabet soup as life itself
once emerged from the primordial soup.
To fullfill this promise there are now interdisciplinary Internet courses where people can learn about biocomputing...