Why Biologists Should Not Treat Software as a Black Box

      Joelle Thonnard

      Introduction

      In the last 10 years a torrent of protein, RNA and mainly DNA sequence data has been flooding out, largely under the impulse of the Human Genome Project. Thus biologists are facing the problem of storing and analyzing tremendous amounts of data. Fortunately, during that time, sequencing and computer technologies have grown in parallel, leading to the spread of computational biology.

      Currently, a lot of genomic and protein sequence information is stored in different public computer data banks. Moreover, user friendly and powerful software for sequence analysis is now readily available. Hence, biologists do not necessarily need to be expert in mathematics and computer science to use tools for computer aided sequence analysis.

      Biologists do not have to deal with obscure formulas like:

      Neither do they have to play with cryptic computer programs like:

      However, they should not treat software as a black box. They rather need to understand the assumptions and the principles of the methods on which they rely. Indeed, to accurately interpret the results obtained with a computer program, researchers must keep in mind the assumptions made by the methods. Moreover, to most efficiently use the available programs, they must be able to play with the variable parameters allowing them to improve their results. But this requires an understanding of the principles of the methods. These points could probably be better explained by using an example, such as sequence database searching.

      An Example: Sequence Database Searching

      Importance of Knowing the Assumptions

      When a researcher has obtained the DNA sequence of a new gene and when he has determined the amino acid sequence of the protein that this gene encodes, he still knows nothing about the three-dimensional (3-D) structure and function of the protein. The 3-D structure of a protein can be determined by techniques such as X-ray crystallography and its function can be studied by a variety of biological tests. But a way to obtain some predictive data is to compare its amino acid sequence with all sequences in databases in order to find similar ones. A similar sequence may imply homology (common ancestor) and therefore may imply common function. Thus, if a query sequence is highly similar to a protein whose 3-D structure and function are known, one can build a model for the new sequence.

      The Process of Evolution

      Indeed, homologous proteins arise from mutations in a common ancestor coding gene. Through the process of gene divergence, some gene mutations have been accepted by natural selection because they preserved the folding and function of the coded protein. The following figure represents a schematic tree where several genes come from a common ancestor gene.

      Genes are actually much longer, they have a length of more than 100, not just a lengh of 4. But as I said, this is just a schematic tree anyway...

      Homology versus Similarity

      Thus, homologous proteins generally have a similarity higher than 25 % over their entire length and share similar 3-D structure and function. However, some gene mutations yielding proteins with new function have also been accepted by natural selection. Thus, a common ancestor does not systematically imply common function. Moreover, some proteins have diverged so much that their sequence similarity has dropped below 25 % although their folding and function have been preserved. Thus, low similarity does not exclude homology and common function. On the other hand, some proteins are highly similar although they are not homologous and do not have similar 3-D structure and function. Generally, in these cases, they are highly similar over short stretches of sequence. Matches that are more than 50 % identical in a 20 - 40 amino acid region occur frequently by chance.

      The following figure visualizes what we just discussed. The red arrows represent the " ordinary " way of things, whereas colored arrows represent uncommon but existing situations.

      Similarity Searching Tools Do Not Necessarily Retrieve Homologous Sequences

      Several programs are intended to retrieve homologous sequences to a query sequence, assuming that highly similar sequences are homologous. Thus, these programs measure a similarity score between a query sequence and sequences in a database. They can also estimate the statistical significance of each score. But, in fact, they cannot determine when a certain level of similarity implies homology. And this is one of the reasons explaining why biologists should not treat software as a black box.

      The example discussed by Dr Bill Pearson in his lecture (Protein evolution - How far back can we see?) also shows this.

      Searching a Database with Bovine Trypsin

      In this example, the bovine trypsin protein has been compared to sequences in a database. Comparison results with four sequences are summarized in this figure.


      The numbers following the name of the sequence represent:

      • %: percent identity
      • E: expectation value (E represents the probability to find such a similar sequence by chance. To be statistically significant, E must be < 0.05)

      The 3-D structure of these compared sequences can be seen on Pearson`s slide 1. Animations of the four proteins can be seen here (this takes some time to load and run).These results show that bovine trypsin is very similar to bovine chymotrypsin (seq. 1) and S. griseus trypsin (seq. 2). These three proteins share similar 3-D structure and function. They are all related of the serine proteases. Bovine trypsin is also similar to endochitinase (seq. 3). However, these two proteins are unrelated and do not share any structural and functional similarity. On the other hand, the similarity between bovine trypsin and S. griseus protease (seq.4) is lower although this protein shares similar 3-D structure and function with the serine protease family of proteins and is thus considered as being homologous to them.

      Accurate Interpretation of the Results

      When searching a database with a query sequence, the program used will always find sequences with a certain level of similarity. However, this does not automatically imply that they are homologous to the query sequence. Thus, when performing a database search, a researcher must know the difference between homology and similarity. Otherwise, he is facing the risk of giving erroneous meaning to the results and could conclude that a protein has a function that it does not have because he is assuming that similar sequences are homologous although this is not always true.

      Importance of Knowing the Principles

      Sensitivity Versus Specificity

      There are different ways to estimate similarity between two sequences, allowing us to modify the sensitivity and specificity of the results when performing a sequence database search with a query sequence. If the sensitivity is high, more distantly related sequences as the S. griseus protease in the above example will be retrieved. However, unrelated sequences as the endochitinase will also be returned. On the other hand, if the specificity is high, only closely related sequences will be returned but, in this case, distantly related ones will be missed. Thus, a researcher has to know how he could manage this problem. And this is one additional reason explaining why biologists should not treat software as a black box.

      Window approach

      In particular, when comparing two sequences, a dot matrix can be used where one sequence is written out horizontally and the other is written out vertically. A dot is placed at the intersection of a row and a column for each matched pair of letters. If the frequency of matched letters between two sequences is high, particularly in DNA sequences, which are composed of only four building blocks, the background noise is high.

      In order to reduce the noise, one can place a dot only when several joined letters are matched. The numbers of joined letters evaluated together is called the window size.

      This window approach allows us to find more significant matches between two sequences. But, on the basis of different assumptions, different window sizes can be used yielding different comparison results.

      Efficient Use of the Programs

      When performing a database search, a researcher must know that he can improve his results. If he knows the principles, the use of windows, he will be able to increase the sensitivity by decreasing the window size parameter. This will improve the ability of the program to recognize distantly related sequences. Alternatively, he will be able to increase the specificity by increasing the window size parameter. Since there are generally 1000-times more unrelated than related sequences in a sequence database, improvements that reduce the score of unrelated sequences can have dramatic effects.

      Conclusion

      These short examples show that it is important for a researcher who wants to use the programs available for sequence analysis to acquire a reliable knowledge of biocomputing. Knowing the capabilities and the drawbacks of the programs will help to use them in a more accurate and efficient way.


      Joelle Thonnard
      Back to Biocomputing For Everyone WWW Pages