VSNS BioComputing Division

      BioMOO Electronic Conference transcript, HTML Edition

      The session was held Friday June 7 1996, 19 h GMT


      Fast Database Searches and Evolutionary Relationships Among Proteins


      Special guest and invited lecturer:

      Dr. William R. Pearson
      Dept. of Biochemistry
      University of Virginia
      USA


      A list of participants, and the Bionet announcement / summary can be found below. There are also links to other VSNS-BCD Guest Lectures at BioMOO.


      Francisco says, "Hi everybody. It is my pleasure to welcome you to this guest lecture (or session) with Dr. William Pearson from the University of Virginia. Our session will be centered on questions about Fast Database Searches, FastA and the uses of these tools in looking for evolutionary relationships among proteins. If someone has a question, please raise your hand - just be patient, we are very many people here, and try not to speak at the same time! Before we start I wish to thank Bill Person for his participation in this forum. Bill, can you tell us about your new FastA 3.0 release?"

      BillPearson says, "A new version of FASTA (3.0) was released about two weeks ago. It uses the same methods and statistics as FASTA 2.0, but it is designed in a modular fashion and runs not only on conventional unix workstations, but also on multiprocessors in parallel. It has a few other improvements over FASTA2.0 - the statistical calculations are based on a larger sample of the database and the statistical estimates for TFASTA, which were not very good in FASTA2.0, are now very accurate. In addition, there is now a FASTX [and now TFASTX], which compares a DNA sequence to a protein database, translating in three frames and allowing for frame-shifts, which is available both in FASTA2.0 and FASTA3.0"

      GeorgF says, "A natural question is: Which improvements are you planning next ?"

      BillPearson says, "There are several improvements that I am thinking about:

      • FASTA/SSEARCH should return more than one best local region in the alignment the way BLAST does.
      • There should be a TFASTX, which is like TFASTA but does a better job of putting frameshifts into the comparisons (it is the complement of FASTX). [TFASTX is now available in FASTA2.0u6 and FASTA3.0t6.]
      • I hope to have a more sensitive version that uses a BLAST-like lookup in the initial phase.
      • I have some ideas for better statistical estimates. "

      CATHIE says, "I have a question: What happened with the program that tries to find the minimum number of primers for some set of related sequences?"

      BillPearson says, "We have been disappointed in the performance of the PRIMER program, so it is not publicly available. We should do some polishing and provide it later this summer."

      Gufo says, "May I pose a question about FASTX: does it give back a DNA sequence aligned to a protein match ?"

      BillPearson says, "No - FASTX shows you a protein vs protein comparison, because that is the score that is actually being optimized. FASTX contains two extra characters, '/' and '\', which indicate frameshifts, as well as the usual '-' for gaps."

      Francisco says, "Bill, we have had some discussions about why it is necessary to use protein sequences instead of DNA (if we got an ORF) for database searches - some people argue that you loose information doing that. Can you tell us something about this?"

      BillPearson says, "Now that I have FASTX - it is a bit easier to demonstrate that protein-protein comparison is more effective than DNA-DNA comparison. If I do a protein-protein search with FASTA and then do exactly the same search with the cDNA sequence - so that exactly the same coding region is available, the statistical significance of the similarity scores is reduced by about a factor of 5. This happens because the statistical significance is determined in this case solely by the distribution of the unrelated sequence scores (the related scores are the same because exactly the same coding information is provided). But with the FASTX translated DNA sequence, there are more ways of producing high scores with unrelated sequences (the other two reading frames can produce scores). So, since the scores of the unrelated sequences are higher, the statistical significance is reduced."

      CATHIE says, "How do you know when a certain level of similarity implies homology? I mean, shall we do additional searches with sequences with lower but statistically significant scores?"

      BillPearson [to CATHIE]: I am a big fan of doing more than one search. (I do lots of searches). Doing multiple searches can help if your best match has a low, but statistically significant expectation value, say 0.01. If you have matches with E() = 0.01 with three or four distantly related members of a family, then you have much more confidence that the match is real. However, if the three or four family members are closely related to one another, then you would expect to get the same match with each sequence, and so you have not learned much."

      CATHIE says, "I think that even the average of an unrelated sequence increase with length, the variable score is constant for local similarities, is this true?"

      BillPearson says, "Additional searches are most valuable if you have good scores with distant homologues. You should also do additional searches to make certain that the best hit is not just because of a particular gap penalty or scoring matrix. For example, you can have much more confidence in a match if, after you find a candidate distantly related sequence, you do the search again with a higher gap penalty and the significance of the distant match increases. Also, if the region of alignment between your query sequence and the library sequence is much shorter than the length of the sequences, then, if the relationship is real, the statistical significance should improve by using BLOSUM62 instead of BLOSUM50. FASTA2.0 and 3.0 now make it easy to do searches with BLOSUM62 as well as the default BLOSUM50."

      GeorgF [to BillPearson]: Are there more "general rules" ? Especially about the trade-off between 1) Multiple searches 2) More strict searches (e.g. ktup=1) ?

      BillPearson [to CATHIE]: I did not understand your question.

      CATHIE says, "The new Fasta3 is as good as Smith-Waterman algorithm or is it in the right way to overpass its performance?"

      GeorgF says, "-re- Multiple searches <-> More strict searches: Are there "indicators" for stricter searches, too ?"

      BillPearson [to GeorgF]: The indicator for stricter searches is the number of high-scoring, probably unrelated sequences with expectation values around 0.1. If you have a lot of sequences that are probably not related but have E() < 1.0, you should try another search with higher gap penalties.

      BillPearson [to CATHIE]: Fasta3 is not any better than Fasta2, in general. Smith-Waterman (SSEARCH) will do better in some cases, but those cases are rare.

      BillPearson says, "I usually do a search first with FASTA ktup=2, and then FASTA, ktup=1. If I don't find anything with those two searches, I find something with Smith-Waterman (SSEARCH) less than 10% of the time."

      CATHIE says, "Smith-Waterman is better for full length sequences I think!"

      BillPearson [to CATHIE]: In my Protein Science paper, I show that S-W is better than FASTA for full length sequences. However, that paper - to focus on differences between the methods - looked at the most divergent protein families. Most of the time, one is not trying to match a mammalian and bacterial sequence. If you are only looking back in time 500 million years, FASTA will get the job done."

      Francisco says, "Anyone has more questions? Those students around?"

      HoonL says, "In your experience, do you need to inspect alignment printout very often?"

      BillPearson [to HoonL]: Of course it is important to look at alignments for marginally significant matches. Sometimes a match has a good score because of low complexity repeated amino acid patterns. Only the alignment will show that. Also, alignments that include the entire length of one of the proteins are more likely to be reliable than those that only include part of the sequence."

      BillPearson says, "As I said before, if only part of the sequence is involved, the score should become more significant if a 'shallower' (BLOSUM62 vs BLOSUM50) matrix is used. I don't inspect alignment printout if there are no scores with expectation values < 1.0"

      Francisco smiles.

      PeteO says, "Naive question from a beginner: How do I interpret the output histogram? How are z-opt and E() defined?"

      BillPearson [to PeteO]: z-opt is a z-score converted to something like a similarity score.

      PeteO nods.

      BillPearson says, "So a score with a statistician's z-value of 0 will have a FASTA z-score of 50. A z-value of 5 will have a FASTA z-score of 100. The E() is the expected number of sequences that will have a FASTA z-score in the interval shown, say between 60 and 61. If the z-opt and e() values agree (the =='s and '*'s) then the statistical estimates are good. This is especially important in the tail between z-opt 80 - 120."

      Francisco says, "This is important these days when no one looks at histograms carefully (Blast by default doesn't send histograms!)"

      BillPearson says, "But BLAST p-values are calculated analytically and are very conservative. Because FASTA2.0 (and 3.0) estimates are derived empirically from the search, it is easier for the estimation process to fail. This happens most often if the gap penalties are too low. BLASTP has gap penalties of infinity, so it is less of a problem."

      PeteO says, "Something like p-value"

      BillPearson [to PeteO]: The z-opt() is really not much like a p-value, although the p-value is derived from z-opt(). E() values are the same as BLAST p() values when the E() value is < 0.1."

      PeteO nods.

      CATHIE says, ""We can not always be sure about statistics, what about false negatives?""

      BillPearson [to CATHIE]: There is not much one can do about false negatives, except wait for the database to be more complete. Some families have only a few members in the database that may be very distant from your query sequence. You will always miss those in any pair-wise comparison search. Some multiple-sequence methods may do better, but if there are enough sequences in the database to produce a good multiple alignment or profile, there are probably enough sequences that the problem of false negatives is reduced.

      Francisco eyes around wondering if silent people also have questions.

      Marvin [to BillPearson]: " How can we tell if sequences with high but not significant scores are related to the query sequence?"

      BillPearson says, "It is very difficult. The only time that I follow up non-significant matches is if there are several matches to distantly related members of a family that I recognize. Mostly, if there are are no significant matches, I simply accept the result."

      Marvin nods.

      Lappe says, "It seemed to me that the issue of searching again with a distant homologue is a quite critical one. I wonder if this could be automatically done by an algorithm in a sensible way.. ?"

      BillPearson says, "That is a good suggestion. In fact, several groups (Koonin, Lander) have suggested and implemented such a strategy. Now that FASTA and SSEARCH have good statistical estimates, I think it can be done more reliably, but I have not done it myself. In addition, it would be helpful if the FASTA results showed annotation information that told the user that the high scoring sequences were all distantly related members of the same family. The BLOCKS database might be used for this."

      Francisco says, "Bill, if we have good matches (i.e. E()<0.002) can we conclude they have a similar fold? Or to put this in other form, aa similarity always means similar folding?"

      BillPearson says, "If they have E()<0.002 because the sequences are homologous, and not just because they are PRLPRLPRL or some other simple sequence, then yes, the proteins have not just the same fold but very similar three-dimensional structures."

      Gautam says, "Dr. Pearson, a little while ago you mentioned the phrase: *low statistically significant*. In my understanding a low expectation value is statistically significant. So, why is it necessary to use in that phrase. It implies that there are some non-analytical methods to measure significance (which is a statistical term). Is this correct? "

      BillPearson says, "I am sorry, I meant a low similarity score, not a low expectation value."

      Gautam says, "Thanks.. that satisfies my curiosity."

      BillPearson says, "If a match has a similarity score of 500 and E()< 10^-10, I always have confidence in it. If the score is 100 and E()<0.05, I am more nervous."

      Francisco says, "just one more question, please."
      Francisco looks around.

      CATHIE says, "To resume, which are are the principal differences from S-W, Fasta, Blasta and the Prosite search?"

      BillPearson [to CATHIE]: S-W, FASTA, and BLASTP are all slightly different methods for doing similarity searches. PROSITE is something completely different, because it looks for an exact match to a pattern rather than a similarity score."

      BillPearson says, "S-W, Fasta, and so a slightly lesser extent BLASTP base the similarity score on the entire conserved region between two sequences. Typically this is the entire length of the protein or the protein domain. PROSITE looks at a very small number of very highly conserved residues. You cannot easily calculate a statistical significance for a PROSITE search. I believe that similarity searches are far more robust and trustworthy than pattern/PROSITE searches."

      Gautam says, "In regards to E() value, why are operators < and > used. For example, you might say that a sequence related to serine protease has a similarity score <66 to make a point about false negatives. In order to make the same point more emphatically, would it not be better to use an E()>66. Or am I missing something?"

      BillPearson [to Gautam]: It should always be E() < ???.

      Francisco says, "I wish to thank Bill again for his participation in the BCD BioComputing Course - it has been great help to have you here!"

      Marvin claps.
      Gautam claps.
      CATHIE claps too!
      Ulf claps.

      BillPearson says, "Good bye all - I am happy to answer questions (or do searches) at wrp@virginia.edu"

      Lappe cheers.
      CATHIE keep on clapping!
      PeteO thanks Bill Pearson :)

      Francisco says, "So, this session is finished now - thanks also to the participants!"

      GeorgF says, "In the name of the Virtual School of Natural Sciences, I'd like to thank Bill Pearson as well, and also the BioMOO technical help (Gustavo, EricM, Yacco), and all the other helpers too numerous to mention ! "

      ChristianF claps.
      BEN-AC says, "bye"
      Gufo says, "bye".
      CATHIE waves.
      Lappe waves goodnight.
      Gautam waves bye to all.
      BillPearson has disconnected.
      Marvin waves.
      CATHIE says, "NICE TO MEET YOU"
      Francisco waves.
      CATHIE says, "It was a pleasure to be here for the first time!"


      Participants (most descriptions taken from the public BioMOO user database):

      • [AnaCC] Ana Cecilia Cepeda Nieto, student, CINVESTAV - IPN, Mexico
      • [BEN-AC] Benito Aguilar Carrillo, student, CINVESTAV IPN, Mexico
      • [Bergeron] Anne Bergeron, Professor, Universite du Quebec a Montreal, Canada
      • [CATHIE] Venturelli-R-Caterina, Graduate Student (Molecular B. & Genetics) at the CINVESTAV IPN, Mexico
      • [ChristianF] Christian Frosch, PhD student (chemistry/biochemistry), University of Mainz, Germany
      • [EdgarA] Edgar Arriaga, Research Associate, University of Alberta, Chemistry Dept., Canada
      • [EricM] Eric H. Mercer, post-doc, California Institute of Technology, USA
      • [Ever] Everardo Gonzalez Rodriguez, student, Centro de Investigacion de Estudios Avanzados de el IPN (CINVESTAV), Mexico
      • [Feng] Feng Qian, student (pharmacology), University of Chicago, USA
      • [Francisco] Francisco M. De La Vega, Assistant Professor, Dept. of Genetics and Molecular Biology, CINVESTAV-IPN, Mexico
      • [Gautam] Gautam B. Singh, Senior Bioinformatics Engineer, National Center for Genome Resources, USA
      • [GeorgF]: Georg Fuellen, PhD student, University of Bielefeld, Germany
      • [Gufo] Alessandro Guffanti, student (biology), TIGEM, Milano, Italy
      • [HoonL] Chin Hoon LAU, Ph.D. student, Department of Biochemistry, National University of Singapore
      • [Lappe] Michael Lappe, student, Department of Mathematics and Computer Science, University of Paderborn, Germany
      • [Maricela] Maricela Cruz Soto, student, CINVESTAV, Mexico
      • [Marvin] David T. Croke, Senior Lecturer, Royal College of Surgeons, Ireland.
      • [PeteO] E. Peter Olds, USA.
      • [SDelinger] Scott L. Delinger, Research Associate, University of Alberta, Chemistry Dept., Canada
      • [Ulf] Ulf Reimer, graduate student (biochemistry), MPG AG "Enzymologie der Peptidbindung" Halle/Saale, Germany
      • [VBuskirk] Chris van Buskirk, Computer Scientist, NCI-Biomedical Supercomputing Center, USA



      Bionet Announcement

      Available: Electronic Meeting transcript on "Fast Database Searches and Evolutionary Relationships Among Proteins", by Dr. W. Pearson.

      The latest release of Fasta (v. 3.0) and its appropriate use for fast database searches was the main topic discussed in the Guest Lecture delivered by Dr. William Pearson of the Department of Biochemistry at the University of Virginia, USA, at the electronic conferencing system BioMOO on June 7th, 1996. Most of the 21 attendants, which came from 7 countries, were students of the Summer 1996 BioComputing Course, a novel distance learning experience organized for the second time by the BioComputing Division of the Virtual School of Natural Sciences, a member school of the Globewide Network Academy. The Course covered the fundamentals of sequence analysis and comparison, was completely delivered through the Internet for free and lasted 2 and a half months, attracting 37 participants from all over the world (for an account of the course, and access to its hypertext book and related materials, visit the URL: http://www.techfak.uni-bielefeld.de/bcd/welcome.html ).

      The edited transcript includes a link to the main FastA v 3.0 FTP distribution site and to previous lectures, and is available at the following locations (WWW/hypertext): http://www.techfak.uni-bielefeld.de/bcd/Lectures/pearson3.html http://merlin.mbcr.bcm.tmc.edu:8001/bcd/Lectures/pearson3.html http://www.biotech.ist.unige.it/bcd/Lectures/pearson3.html

      The discussion included an account of the new features incorporated in version 3.0 of FastA and how it compares with the popular NCBI BLAST algorithm. The influence of the length of the sequences on the database similarity scores and how this is corrected in FastA 3.0 was an important point of the talk. A strategy for identifying distant protein homology during a fast database search, and general tips on the interpretation of the output of these methods was discussed with the participants.

      The Electronic Conference was organised by the VSNS Biocomputing Division, sponsored by the Research Group in Practical Computer Science, University of Bielefeld, and the Association for the Promotion of Science and Humanities in Germany (Stifterverband für die Deutsche Wissenschaft). We would like to thank the Aspen Center for Physics for providing its facilities for the 1996 "Identifying features in Biological Sequences" Workshop and the Internet link from which Dr. Pearson and Dr. De La Vega connected, and to Digital Equipment Corp. for the workstation cluster provided for the workshop. Alexander Sczyrba, David Atherton, Gustavo Glusman, Eric Mercer, and other "BioMOO folks" (see http://bioinfo.weizmann.ac.il/BioMOO) are thanked for their kind assistance.

      Regards,
      Francisco M. De La Vega
      Georg Fuellen

      GNA-VSNS BioComputing Division


      Converted to HTML by Alexander Sczyrba and Georg Fuellen.