Francisco's group 4th BioMOO Meeting

VSNS BioComputing Division - BioInformatics Course

Francisco De La Vega's Study Group

4th BioMOO Session Transcript

June 1st, 1995, 24:00 hrs GMT


Discussion Theme: Rapid Database Searches


Special guest and invited lecturer:

Dr. William Pearson (BillPearson), Dept. of Biochemistry, University of Virginia.

Connecting from the Aspen Center for Physics's workshop "Patterns in Biological Sequences".


Participants:


There are also links to other VSNS-BCD Guest Lectures at BioMOO.

The recording: (See also the Announcement of this transcript.)

EricM turns GeorgF's recorder on.

GeorgF thanks EricM for getting the recorder going !

BillPearson [guest] says, " Hi folks"

Francisco says, "Ok, we can start the session now!"

GeorgF says, "This is Francisco's 4th session, and I'll let him talk now !"

Francisco says, "We are going to start with some questions to Bill, but let's make some order"

Francisco says, "If some one have a question, please whisper to me to make an order"

Francisco says, "First I wish to thank Bill Pearson for being here; is not so easy to communicate in BioMOO!"

BillPearson [guest] says, " I thank you for the invitation. I have seen BioMOO before, but was never ready to use it"

Francisco says, "I would like you to ask questions regarding FastA, BLast, similarity matrices and the like"

Francisco says, "Who is starting?"
Francisco eyes Mandy

Mandy says, "Ok, what does 'ktup' means in the FastA program?"

CarlosH says, "Do you recomend using mailfasta program?" to BillPearson"

BillPearson [guest] says, " I am not familiar with the mail fasta program. I believe that it uses an older version of fasta, which is not as good as the current version (fasta20x). I would like to encourage all of you to download the latest version of fasta and give it a try. Unlike earlier versions, it calculates a probability for similarity scores much like that provided by BLASTP. I think this is a big improvement, since it allows you to see at a glance whether a high ranking score is likely to be interesting."

CarlosH says, "Bill what about FastA and Blast freq histograms?""

BillPearson [guest] says, "In addition, it uses the Smith-Waterman algorithm to do the final alignments, so there is no limit on gaps."

BillPearson [guest] says, "To Mandy - ktup (1 or 2 for proteins) is the size of the smallest piece of the sequence that fasta looks at when it does a search. So with ktup=2 (the default), fasta only looks at parts of the alignment where there are two identical amino acids. Since for some very distantly related proteins, there may not be any pairs of identical amino acids, ktup=1 is more sensitive. With the new probability estimates, I think the jury is out on whether increases in sensitivity are always accompanied by decreases in selectivity."

Francisco says, "Bill, which is the main problem in using FastA or Blast; finding too many false positives, or missing some good homologous?"

BillPearson [guest] says, "If you look at publications, I think there are many more reports of false positives. Of course, that may just be because when someone misses something, they don't report it. Because it produces good statistical estimates, and because of the way BLASTP works, I think there are very few false positives with BLASTP. Now that FASTA provides similar statistics, hopefully the number of false positives with FASTA will be reduced as well. It is important to remember though, that in most diverse families are many homologous proteins that do not share significant similarity with some other members of the family. So lack of significant similarity does not guarantee non-homology. But it does guarantee that you cannot base a claim of homology on similarity.

GeorgF [to BillPearson]: What are the next improvements you're thinking about -re- FASTA. Including features along the lines of Blastp, has that only become possible because computers are getting more powerful ?

BillPearson [guest] says, "FASTA has gotten slower in its last version because most alignments are optimized by default. The next step forward will be to try to speed it back up, without loosing sensitivity. In addition, I hope to have a parallel version of fasta that can be used routinely available in the next few months. Finally, there are other modifications that can be made to improve selectivity that I will be testing."

GeorgF says, "(Alx is Alex Jeffries, one of my students from Australia. We've now got 3 continents represented :-)" Alx says, "hello all."

GeorgF says, "Parallel version: how much hardware-dependent will it be ?"

Francisco says, "Bill, something that is important is which similarity matrix to use with this methods; the default one is good?"

BillPearson [guest] says, "The parallel version will (does currently) run on networks of workstations under PVM. I have run on suns, DEC Alphas, and RS/6000's, and SGI's."

BillPearson [guest] says, "The scoring matrix used in the current FASTA (BLOSUM50) works pretty well. If anything, I tend to change the gap penalties rather than the scoring matrix. It is now much easier to change gap penalties (they can be specified on the command line). When I find that there seem to be too many unrelated sequences with low probabilities, I raise the gap penalties from -12/-2 to -14, -2 or -16,-2."

Mandy says, "Which gap penalties are better for proteins and which for nucleic acids?"

BillPearson [guest] says, " The new fasta uses -12/-2 for proteins and -16,-4 for DNA. I found that when I used lower gap penalties with DNA, I found many unrelated sequences had very low(misleading) probabilities. But of course, the number one thing that you should learn this evening is that in general, you should try not do do DNA sequence comparison, unless you are really interested in a sequence that does not code for a protein. Not doing DNA sequence comparison comparing proteins instead - is the very most effective way of improving your searches.

Heinz says, "is fasta20x available on the web, like blast & fasta?"

BillPearson [guest] says, " As far as I know, fasta20x is not available yet from a server. Hopefully this will change by the end of the summer. If worse comes to worse, I may have to put one up, but I would rather have the existing servers use the new code."

GeorgF [to heinz]: I've been searching for it, to no avail. But I'm sure it won't take long.

EricM says, "I wonder if there'll ever be an effective way to search large DNA databases for loosely defined sites of protein binding. Protein coding regions work well now, but transcriptional regulatory sites...awful."

BillPearson [guest] says, " Remember that searching for DNA binding sites is very very different from searching for related proteins. Homologous proteins - by definition - are produced by divergent evolution from a common ancestor. DNA binding sites probably arise by some convergence phenomenon. As a result, they tend to have much less information (and may depend much more on other proteins or other context) to be fully functional"

Heinz says, "what if someone is trying to design a primer that will be good for, say, actin genes of different species"

BillPearson [guest] says, " We are working on a program that does just this - it tries to find the minimum number of primers for some set of related sequences."

Heinz says, "this is just the sort of thing that has tremendous PRACTICAL value to bench scientists"

EricM considers the fact that most of the most important new genes in developmental biology the last two years have been found by PCR screening for genes encoding members of a family of related proteins.

BillPearson [guest] says, " Concerning protein homology searches vs DNA binding site searches - I think it is not always appreciated how informative an inference of homology is. If you do a search and find a significant match to a sequence in the NRL_3D datbase - the sequences for the proteins in the PDB protein structure database - you can very accurately thread the second sequence through the first. In contrast, inferences about convergence, such as in DNA binding sites, must always be supported by experiments. I find it a bit disconcerting when people refer to various binding sites when they have not done any experiments."

EricM . o O ( makes the paper look better when one does that )

Mandy says, "Which program performs better, Blast or the new FastA?" BillPearson [guest] says, " The new fasta is significantly better than blastp - I have a paper that has been accepted in Protein Science that compares BLASTP, FASTA, and Smith-Waterman with a variety of scoring matrices and gap penalties. For full length sequences, a version of Smith-Waterman that corrects for high scores due to sequence length performs better than FASTA or BLASTP. For partial sequences, FASTA performs as well as Smith-Waterman and better than BLASTP. The studies in this paper were also used to modify the default scoring matrix and gap penalties for FASTA and Smith-Waterman."

Mandy says, " I see, I'm glad we are recording this, this is all good info!"
BillPearson [guest] says, " I hope your recorder has a built in spelling corrector"
Mandy smiles

Francisco says, "Bill, I would like if you talk a little bit more about length correction; I will paste an ascii version of one of your figures in this regard:"

EricM expects that having a full toolbox, with more than one algorithm that tries to do everything, is the best practical approach. But how to know what to use when, that's hard!

BillPearson [guest] says, " I don't think that it is so hard. The strategy that I always suggest is use the fastest first, and if you don't find anything, use the next fastest, etc. There are a lot of very important similarities that are found with BLASTP - BLASTP doesn't miss much. But, if BLASTP doesn't find anything, try FASTA, ktup=2. If that fails, try ktup=1. If that fails, try SSEARCH (Smith-Waterman). Most of the time, you will find whatever there is to find (or not find) with BLASTP or FASTA, ktup=2. The slower programs rarely reveal significant relationships that the faster programs missed. However, as more divergent genomes become available (yeast, E. Coli), it may be more important to use the more sensitive methods (FASTA ktup=1, SSEARCH), particularly when examining yeast/mammal or bacterial/mammalian relationships"

EricM says, "As someone whose field (neurodevelopment) depends on genes found in Drosophila then pulled from mice, and vise versa, I appreciate that."

----------------------------Francisco at BioMOO---------------------- | | X X X | X X | X X | S| X X _ C| X X _ O| X X _ R| X E| X _ | _ | _ | _ | | |_____________________________________ SEQUENCE LENGHT

Francisco says, "Sorry, this an ascii version of one of Bill's Figures; the X-axis is sequence length"

Heinz says, "do you think there will soon come a time where the servers that currently are freely available will begin to restrict external users so that sites must mount their own fasta, blast, etc. w/local copies of large databases?"

BillPearson [guest] says, " Since I don't provide a server, I don't know much about the economics/problems of providing one. But I think that for the forseeable future, computers will continue to get faster faster than databases (particularly protein databases) will grow. So if the servers shut down, you will still be able to do the search on your Mac."

Francisco says, "Can you tell us what happens with scores and sequence lengths (in the figure the dashes are non-homologous hits, and the X's, real ones)?"

BillPearson [guest] says, "Statistical theory worked out by Waterman et al, Altschul/Karlin, and Mott have shown clearly that the average unrelated sequence score increases with the ln() of the length of the unrelated sequence, but that the variance of the scores is constant for local similarities. The new FASTA and SSEARCH do a linear regression of the library scores as a function of ln(length) and then converts all the scores to a z-value that removes the length effect. There is very good agreement between the distribution of actual unrelated scores and the predicted distribution from the extreme value distribution (you see this every time you do a search with fasta20x) This is where the probability estimates come from. Note that while z-values are used to normalize the scores, the distribution of similarity scores is not normal (gaussian) so a z-value of 3.0 or 5.0 is not usually significant. In fact, for a typical search with a 200 residue query, match with a z-value of 6.0 is expected 0.25 of the time when swissprot (44000 entries) is searched."

Mandy says, " Is it possible that related proteins to the query may appear with statistically lower scores?"

BillPearson [guest] says, " absolutely. Here is an example:"

---------------------------BillPearson at BioMOO--------------------------- GT2_DROME GLUTATHIONE S-TRANSFERASE 2 (EC 2 ( 247) 164 124 124 210.1 2.9e-05 SC2_OCTVU S-CRYSTALLIN 2. ( 215) 153 234 106 196.8 0.00016 GTAC_CHICK GLUTATHIONE S-TRANSFERASE, CL-3 ( 228) 144 123 87 185.0 0.00074 SC18_OMMSL S-CRYSTALLIN SL18. ( 308) 131 186 85 162.4 0.013 GTTR_RAT GLUTATHIONE S-TRANSFERASE YRS-YRS ( 243) 123 152 97 157.5 0.025 GT1_MUSDO GLUTATHIONE S-TRANSFERASE 1 (EC 2 ( 208) 122 120 69 157.4 0.025 SC1_OCTVU S-CRYSTALLIN 1. ( 214) 121 165 111 155.8 0.031 SC2_OCTDO S-CRYSTALLIN 2 (OL2). ( 215) 118 234 102 151.9 0.051 ARP_TOBAC AUXIN-REGULATED PROTEIN (STR246C ( 220) 117 100 64 150.5 0.061 SC4_OCTDO S-CRYSTALLIN 4 (OL4). ( 215) 116 214 95 149.3 0.072 SC3_OCTDO S-CRYSTALLIN 3 (OL3). ( 215) 115 220 100 148.0 0.084 GT32_MAIZE GLUTATHIONE S-TRANSFERASE III (E ( 221) 108 117 84 138.9 0.27 GT1_MAIZE GLUTATHIONE S-TRANSFERASE I (EC 2 ( 213) 102 104 70 131.5 0.7 SC20_OMMSL S-CRYSTALLIN SL20-1 (MAJOR LENS ( 222) 102 223 96 130.9 0.76 GT1_DROME GLUTATHIONE S-TRANSFERASE 1-1 (EC ( 209) 100 105 63 129.0 0.96 GT1_DROMA GLUTATHIONE S-TRANSFERASE 1-1 (EC ( 200) 99 105 63 128.1 1.1 GT1_DROSE GLUTATHIONE S-TRANSFERASE 1-1 (EC ( 200) 99 105 63 128.1 1.1 GT1_DROTE GLUTATHIONE S-TRANSFERASE 1-1 (EC ( 200) 99 105 63 128.1 1.1 GT1_DROYA GLUTATHIONE S-TRANSFERASE 1-1 (EC ( 200) 99 105 63 128.1 1.1 GT1_DROER GLUTATHIONE S-TRANSFERASE 1-1 (EC ( 200) 98 104 63 126.8 1.3 GTY2_ISSOR GLUTATHIONE S-TRANSFERASE Y-2 (E ( 190) 94 67 67 122.0 2.4 ARP2_TOBAC AUXIN-INDUCED PROTEIN PGNT35/PCN ( 223) 93 92 62 119.6 3.2 MOD5_YEAST TRNA ISOPENTENYLTRANSFERASE (EC ( 427) 100 46 46 119.5 3.3 GTT1_RAT GLUTATHIONE S-TRANSFERASE 5 (EC 2 ( 239) 93 99 69 119.0 3.5 LIGE_PSEPA BETA-ETHERASE (BETA-ARYL ETHER C ( 280) 91 78 78 115.3 5.6

BillPearson [guest] says, "Here, you see that there are glutathione transferases with E() values > 2 that are homologous to the query sequence. In fact, there will often be, for very diverse families, related sequences with E() values > 5 or even 10. But, the only way to show that they are in fact related is to redo the search with the most distant sequences as queries and then check to see if you find matches to other members of the family with E() values << 0.01."

Francisco says, "By most distant you mean those that fall almost at the cut-off of statistical significance?"

BillPearson [guest] says, " Yes, or those that have marginal significance. So, if I took the sequence GTT1_RAT in the example above and did another search, I would find lots of glutathione transferases with E() values < 0.0001. But when you compare the sequence that I used (mgstm1) with GTT1_RAT, the E() value is only 3.5."

Francisco says, "Some other questions to our guest?" Francisco eyes all
Mandy says, " He may have sore fingers by now from typing at a 1000 words/sec"

GeorgF is wondering whether the idea of having a FASTA-like search w/ input of >1 query sequence has merit. Sort of Multiple Alignment where one sequence is a whole databank.

CarlosH says, "what would you comment about histogram frequencies "" BillPearson [guest] says, "I.m not sure what you mean?" CarlosH says, "I refer to printout of the histograms from fasta" CarlosH says, "I mean, why sometimes it appears cut"

BillPearson [guest] says, " With the current version of FASTA, histograms should be less likely to be cut off. The plotting routine now checks first to see the height of the highest bin and tries to scale the histogram."

CarlosH says, "okay"

BillPearson [guest] says, " TO GeorgF - I am not a big fan of profile type searches, but I suspect that a future version of the FASTA package will include a profile SSEARCH, and possibly even a profile FASTA. I think, in general, that conventional pairwise searching is just as effective at finding distant relatives, and perhaps less likely to produce false positives."

Heinz says, "you mean like a prosite type of search?"

BillPearson [guest] says, " A prosite search is much different from a s BLASTP/Fasta/Smith-Waterman search because so much information is thrown away. Remember - if proteins are homologous, they have the same fold. To have the same fold, you must be similar throughout the length of the protein (or at least the folding domain), so it is the entire length of the protein, not just the few residues that are absolutely conserved in the family members that we know about, that contains information about homology."

Francisco says, "Only one more question?" Francisco says, "Ok, if there are no more questions I would like to thank our guest for sharing his time with us today!!" Francisco smiles
BillPearson [guest] says, " Happy searching." CarlosH says, "Nice to meet you !!!""
EricM smiles
Francisco waves at Bill
Francisco says, "Our session is finished for today!!" BillPearson [guest] has disconnected.
Alx waves goodbye.
Heinz says, "thanks!"
GeorgF bows
GeorgF says, "Very nice meeting :-)"
EricM turns GeorgF's recorder off.

End of Recording


Bionet Announcement:

The June 1st BioMOO weekly meeting of Francisco De La Vega's study group featured the guest lecture of Dr. William Pearson from the Department of Biochemistry, University of Virginia. You probably know that Dr. Pearson is the creator of FastA and an expert in fast database search methods. The session was held on June 1, 24:00 GMT, in the BCD-Classroom at BioMOO and was centered on issues regarding FastA and Blast use, and search strategies to find homologous sequences to a query in databanks. Both Bill Pearson's and Francisco's connections to BioMOO were initiated from the Aspen Center for Physics workshop "Patterns in Biological Sequences", currently being held at Aspen, CO, USA (May 29-June 11, 1995).

One of the most interesting issues discussed at the session was the announcement by Bill Pearson of a new version of the FastA program (v. 2.X) that includes score statistics and a final SW alignment. He also discussed new results regarding FastA 2.X and Blastp comparisons using different substitution matrices (these results just have been accepted for publication in Protein Science).

You can find the recorded transcript of the session at the following URL:

http://www.gene.cinvestav.mx/bcd-session4.html http://www.techfak.uni-bielefeld.de/bcd/Lectures/pearson.html

Enjoy!

Back to Francisco's Group Home Page

Back to BCD-Course Home Page