Here is what you will learn in the following sections: You will understand how the most popular Multiple Alignment heuristic works, and following an example, you will investigate optimal, heuristic, and structurally verified multiple alignments obtained from WWW servers, recapitulating results from an original paper.
[ AliAlongTree ] For more than approximately 8 sequences of average size and similarity, even employing Carrillo-Lipman bounds may not result in a manageable demand on time and memory space, so that an optimal alignment cannot be obtained. (This is the state-of-the-art in 1996.) In such cases, alignment along a tree can be the alternative of choice.

Figure 13: Phylogenetic Tree.
Imagine that you have obtained a phylogenetic tree for the sequences (Fig. 13). This tree may be the result of morphological studies, or it may be obtained from the sequences themselves by one of the methods described in chapter 4. One popular approach (employed by the Clustal software package, http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align-vsns.html), is the generation of all optimal pairwise alignments, the costs of which form the estimated distances between the sequences. From these distances, a tree can then be obtained.
Exercise
[02] How many pairwise comparisons need to be done
for
sequences ?
Alignment along a tree is just this; a tree is used to decide about the sequences that shall be aligned first, because of their close relation. After the first step, more sequences are added by aligning them to the existing alignment; we may also align an alignment to an alignment. Alignment along a tree does not necessarily yield an optimal alignment, even if the tree is "perfect". For example, errors may be made in the very first pairwise alignment and they do not get corrected because information from the other sequences is overshadowed during the later steps.
Exercise
[02, opt] For which kind of trees may you need to
align an alignment to an alignment ? Or, alternatively, for which
kind of trees do you not need to bother with this ?
The technique for aligning alignments is to simulate standard pairwise alignment, but use profiles instead of sequences. For each position, a profile holds a list of the relative frequencies (i.e. values between 0 and 1) of the 20 amino acids (and gap), and the cost of matching a position in profile A with a position in B is calculated by multiplying the (mis)match scores, for each pair of amino acids, by the said amino acids' frequencies at these positions, and summing up.
Exercise
[05A]
Calculate the score of matching the following two positions in
profiles A, and B, respectively:

Use the PAM250 similarity matrix, yielding a similarity score.
Exercise
[10M, opt.] Develop the mathematical formula for the alignment of
profiles. If you like, begin with the formula for aligning one sequence to
a profile. To this end, you need to introduce frequency
vectors of length 21, one vector per position of the profile.
Normally, the alignments obtained thus far are fixed; gaps may only be added. Then, we follow the rule "Once a gap, always a gap" [FeD87], also known as "Progressive Alignment".
Our technique is illustrated by Fig. 14, adapted with permission from [Bar95], the original of which is available at http://geoff.biop.ox.ac.uk/papers/rev93_1/Figure5.ps.
Figure 14: Progressive Alignment.
Some methods (e.g. [BaS87]) do an iterative refinement of the alignment after the initial pass; now gaps may move.
The following concepts may easily be confused:
[ ImmIntro ] We will now apply our knowledge about heuristic and optimal alignment methods to a real-life example. The example is more real-life than usual for a textbook; we will deal with a lot of problems you may face in your own investigations, like hard-to-find sequences, inconsistent data, etc. The author hopes that this has got some advantages, too :-)
Our example is taken from the paper "A Strategy for the Rapid Multiple Alignment of Protein Sequences. Confidence Levels from Tertiary Structure Comparisons." by G.J. Barton and M.J.E. Sternberg, J Mol Biol 1987;198:327-337.
We will discuss alignments of the immunoglobulin sequences they are using; fragments of these sequences have already been featured in the introduction.
Exercise
[05*] Get the paper ! J Mol Biol, the Journal
of Molecular Biology, is an absolute "must" for any university
library. Students of the GNA-VSNS Biocomputing Course may receive a copy
from the instructors/organizers, if needed.
Nevertheless, care has been taken to ensure that the following section
is self-contained.
Exercise
[10*] Inform yourself about the molecular biology of immunoglobulins;
light chain, heavy chain, disulphide bridges, constant region, so-called variable region,
and how they fit together. (See also Fig. 15, below.)
[ ImmDescr ] The Barton & Sternberg paper is now 9 years old; it's from the early days of Multiple Alignment ! 9 years can be a long time for sequences, too, as we will find out really soon.
The authors write the following about their selection of sequences; formatting their description was done by the textbook author. "Eight domains were selected (Brookhaven Data Bank codes).
(FABCL);
(FABVL);
(FABCH1);
(FABVH).
(FCCH2);
(FCCH3).
(FB4VL);
(FB4VH)."
The chains from FAB and FC1 make up one of
the identical halves of an antibody;
one light ("
") chain, and one heavy ("
") chain,
the heavy chain consisting of 3, and the light chain consisting
of 1 constant region, see Fig. 15. For more details, please try
Kevin Shreders's Antibody Resource Page, http://www.antibodyresource.com/,
in particular the link to Mike Clark's page featuring Images of Immunoglobulin Molecules.
As an example of a relevant database, you may explore
the Kabat Database of Sequences of Proteins of
Immunological Interest, http://immuno.bme.nwu.edu/.
Figure 15: Schematic Structure on an Antibody (Immunoglobulin).
The FB4 regions are added to the collection in order to have an equal amount of variable and constant regions. Let me stress that the "variable" regions get their name from the antigen-binding subregions ("CDRs", complementarity- determining regions), which are composed of just a few amino acids each, and give the antibody its specificity. Most of the variable region of an antibody is about as conserved as the constant regions are !
Exercise
[5, opt.] Using the
Molecules R Us
server, http://molbio.info.nih.gov/cgi-bin/pdb,
get some images of the 1FC1
immunoglobulin.
Exercise
[15, opt.] Using technology from the VSNS-PPS course,
http://www.cryst.bbk.ac.uk/PPS/index.html,
you can take a closer look at 1FC1. (This may take some time, though,
if you need to install software, etc. Right now, the GNA-VSNS Biocomputing Course
organizers have not got enough time resources to help you
intensively.)
In the alignments from the paper and from our introduction, the sequences are arranged as follows:
(FABVL);
(FB4VL);
(FB4VH);
(FABVH).
(FCCH2);
(FABCL);
(FABCH1);
(FCCH3).The arrangement of the variable and constant sets is done to maximize similarity of adjacent sequences: both FB4 variable regions go together, and both FAB constant regions go together. We will use this numbering (BS1-BS8) throughout.
Up until the beginning of the next subsection (3.4) the following is an optional part of the chapter, in which you will retrieve the sequences from the net, and check your results.
[ ImmRetrieval ]
Exercise
[15, opt.] For this exercise, note that there are quite a lot
of differences between the sequences you retrieve and the
sequences from the paper. What's more, the sequences will
be different depending on the data bank you searched !
But don't despair, you will have a scout with you !
Obtain the 8 immunoglobulin sequences, using what you learned in
chapter 2. If you've not read chapter 2 (What a shame !),
start with
Pedro's list, http://www.public.iastate.edu/~pedro/research_tools.html
and try out the various PDB resources.
Hint: 2 of the entries have been superseded, and once you know the new entry IDs,
you can search via
SRS-WWW, http://www.embl-heidelberg.de/srs/srsc.
If you'd like to obtain sequences
with the one-letter code directly, (and you want to end up with
exactly the same sequences as the author), you can access a nice databank for
this via SRS: PDBFINDER. PDBFINDER however does not distinguish
variable and constant regions; they are just concatenated !
(But you don't need to worry about this.)
Let's take a look at the 3 PDBFINDER files you retrieved:
http://www.embl-heidelberg.de/srs/srsc?[PDBFINDER-id:7FAB]
http://www.embl-heidelberg.de/srs/srsc?[PDBFINDER-id:1FC1]
http://www.embl-heidelberg.de/srs/srsc?[PDBFINDER-id:2FB4]
They're quite regular, 7FAB and 2FB4 listing one heavy and one light chain each, and 1FC1 listing 2 identical chains A and B.
Exercise
[05B*] Why are chains A and B identical ?
Exercise
[05B*] "Why do light and heavy chains suddenly have the same length in 7FAB
and 2FB4 ? I thought, the heavy chain is twice as long ?!"
[ ImmRetrievalVerif ] We will now do a plausibility check on whether we've retrieved the right sequences. To this end, we'll align the fragments from the introduction (they are listed in the order BS1-BS8, taken directly from the paper) to the retrieved sequences. Variable and constant regions are still stuck together !
Exercise
[10] Using the
Clustal Query Form, http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align-vsns.html,
align the fragments with
the chains.
Note that the above query provides a Clustal Interface with the 1995
default parameters, so that your alignments match exactly the ones cited in this
text ! If you use the standard BCM Launcher page, you will get different results.
Your Query, in Fasta-Format, should look like:
>7FAB_light_chain ASVLTQPPSVSGAPGQRVTISCTGSSSNIGAGHNVKWYQQLPGTAPKLLIFHNNARFSVSKSGTSATLAITGLQAEDEAD YYCQSYDRSLRVFGGGTKLTVLRQPKAAPSVTLFPPSSEELQANKATLVCLISDFYPGAVTVAWKADGSPVKAGVETTTP SKQSNNKYAASSYLSLTPEQWKSHKSYSCQVTHEGSTVEKTVAP >2FB4_light_chain QSVLTQPPSASGTPGQRVTISCSGTSSNIGSSTVNWYQQLPGMAPKLLIYRDAMRPSGVPDRFSGSKSGASASLAIGGLQ SEDETDYYCAAWDVSLNAYVFGTGTKVTVLGQPKANPTVTLFPPSSEELQANKATLVCLISDFYPGAVTVAWKADGSPVK AGVETTKPSKQSNNKYAASSYLSLTPEQWKSHRSYSCQVTHEGSTVEKTVAPTECS >2FB4_heavy_chain EVQLVQSGGGVVQPGRSLRLSCSSSGFIFSSYAMYWVRQAPGKGLEWVAIIWDDGSDQHYADSVKGRFTISRNDSKNTLF LQMDSLRPEDTGVYFCARDGGHGFCSSASCFGPDYWGQGTPVTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFP QPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKVDKRVEPKSC >7FAB_heavy_chain AVQLEQSGPGLVRPSQTLSLTCTVSGTSFDDYYWTWVRQPPGRGLEWIGYVFYTGTTLLDPSLRGRVTMLVNTSKNQFSL RLSSVTAADTAVYYCARNLIAGGIDVWGQGSLVTVSSASTKGPSVFPLAPTAALGCLVKDYFPEPVTVSWNSGALTSGVH TFPAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKVDKKVEP >1FC1 PSVFLFPPKPKDTLMISRTPEVTCVVVDVSHEDPQVKFNWYVDGVQVHNAKTKPREQQYNSTYRVVSVLTVLHQNWLDGK EYKCKVSNKALPAPIEKTISKAKGQPREPQVYTLPPSREEMTKNQVSLTCLVKGFYPSDIAVEWESNGQPENNYKTTPPV LDSDGSFFLYSKLTVDKSRWQQGNVFSCSVMHEALHNHYTQKSLSLS >BS1-fragment VTISCTGSSSNIGAGNHVKWYQQLPG >BS2-fragment VTISCTGTSSNIGSITVNWYQQLPG >BS3-fragment LRLSCSSSGFIFSSYAMYWVRQAPG >BS4-fragment LSLTCTVSGTSFDDYYSTWVRQPPG >BS5-fragment PEVTCVVVDVSHEDPQVKFNWYVDG >BS6-fragment ATLVCLISDFYPGAVTVAWKADS >BS7-fragment AALGCLVKDYFPEPVTVSWNSG >BS8-fragment VSLTCLVKGFYPSDIAVEWESNG
Here is the result you will get:
Page 1.1
1 15 16 30 31 45 46 60 61 75 76 90
1 7FAB_light --------------- --------------- --------------- --------------- --------------- --------------- 0
2 2FB4_light QSVLTQPPSASGTPG QRVTISCSGTSSNIG SSTVNWYQQLPGMAP KLLIYRDAMRPSGVP DRFSGSKSGASASLA IGGLQSEDETDYYCA 90
3 2FB4_heavy --------------- --------------- --------------- --------------- --------------- --------------- 0
4 7FAB_heavy --------------- --------------- --------------- --------------- --------------- --------------- 0
5 1FC1 --------------- --------------- --------------- --------------- --------------- --------------- 0
6 BS1-fragme --------------- --------------- --------------- --------------- --------------- --------------- 0
7 BS2-fragme --------------- --------------- --------------- --------------- --------------- --------------- 0
8 BS3-fragme --------------- --------------- --------------- --------------- --------------- --------------- 0
9 BS4-fragme --------------- --------------- --------------- --------------- --------------- --------------- 0
10 BS5-fragme --------------- --------------- --------------- --------------- --------------- --------------- 0
11 BS6-fragme --------------- --------------- --------------- --------------- --------------- --------------- 0
12 BS7-fragme --------------- --------------- --------------- --------------- --------------- --------------- 0
13 BS8-fragme --------------- --------------- --------------- --------------- --------------- --------------- 0
Page 2.1
91 105 106 120 121 135 136 150 151 165 166 180
1 7FAB_light --------------- -----------ASVL TQPPSVSGAPGQRVT ISCTGSSSNIGAG-H NVKWYQQLPGTAPKL LIFHNNARFSVSKSG 63
2 2FB4_light AWDVSLNAYVFGTGT KVTVLGQPKANPTVT LFPPSSEELQANKAT LVCLISDFYPGA--V TVAWKADGSPVKAGV ETTKPSKQSNNKYAA 178
3 2FB4_heavy --------------- -----------EVQL VQSGGGVVQPGRSLR LSCS-SSGFIFSS-Y AMYWVRQAPGKGLEW VAIIWDDGSDQHYAD 62
4 7FAB_heavy --------------- -----------AVQL EQSGPGLVRPSQTLS LTCT-VSGTSFDD-Y YWTWVRQPPGRGLEW IGYVFYTG------- 55
5 1FC1 --------------- ---------PSVFLF PPKPKDTLMISRTPE VTCVVVDVSHEDPQV KFNWYVDGVQVHNAK TKPREQQYNSTYRVV 66
6 BS1-fragme --------------- --------------- -------------VT ISCTGSSSNIGAG-N HVKWYQQLPG----- --------------- 26
7 BS2-fragme --------------- --------------- -------------VT ISCTGTSSNIGS--I TVNWYQQLPG----- --------------- 25
8 BS3-fragme --------------- --------------- -------------LR LSCS-SSGFIFSS-Y AMYWVRQAPG----- --------------- 25
9 BS4-fragme --------------- --------------- -------------LS LTCT-VSGTSFDD-Y YSTWVRQPPG----- --------------- 25
10 BS5-fragme --------------- --------------- -------------PE VTCVVVDVSHEDPQV KFNWYVDG------- --------------- 25
11 BS6-fragme --------------- --------------- -------------AT LVCLISDFYPGA--V TVAWKADS------- --------------- 23
12 BS7-fragme --------------- --------------- -------------AA LGCL-VKDYFPEP-V TVSWN---SG----- --------------- 22
13 BS8-fragme --------------- --------------- -------------VS LTCLVKGFYPSD--I AVEWESNG------- --------------- 23
(continues alignment of full chains)
Exercise
[00B] Why did we not just use our text editor to find the fragments
in the sequences ? Exercise
[00] Why did we not use a local alignment technique like the one
presented in chapter 1 ? That would have worked much better !
[ ImmRetrievalVerifDisc ] Let us interpret the Clustal Alignment. First of all, 2FB4 light got shifted; its constant region seems to be very similar to the variable regions of other chains !
Exercise
[02B] How do we know that the PDBFINDER files list the variable region
followed by the constant region, and not vice versa ?
BS1, BS3 and BS4 align as expected to the variable regions of the 7FAB / 2FB4 chains. (There is an HN/NH difference between fragment BS1 and 7FAB, at positions 150-151, Also, Pos. 152 is inconsistent for BS4.) BS2 is supposed to align with the variable chain of 2FB4 light, but it doesn't ! Indeed, taking a look at the tree used by Clustal (Fig. 16), we see that it aligns the profiles containing BS2 and 2FB4 light at a rather late stage, so that BS2's high similarity (not identity, due to whatever errors) with the subsequence VTISCSGTSSNIG SSTVNWYQQLPG in the 2FB4 light variable region (pos. 18-42) has been overshadowed during profile alignment.

Figure 16: Phylogenetic Tree used by Clustal.
Next, observe that BS5 and BS6 are aligned properly to 1FC1 and 2FB4 light, respectively. For BS5 that's OK (1FC1, as we've said in the beginning, is indeed the concatenation of the heavy chain's second and third constant region, and BS5 is a fragment from the second constant region; see also Fig. 15.) For BS6, this is a little miracle; after all, it aligns to the constant region of 2FB4 light, which is not in our collection of 8 immunoglobulin sequences !
Exercise
[02] So which two constant regions have an identical (sub)-sequence ?
If you're not sure, use your text editor, searching this text for even
smaller fragments like "LVCL". Can you find a reason for this identity ?
BS7 and BS8 are obviously misaligned; you will find copies of them at the end of 7FAB heavy, and 1FC1, in the constant regions (these ends are cut away in the Clustal Alignment shown.)
The following 3 exercises are concerned with some problems you encounter when doing databank retrieval.
Exercise
[05, opt.] Find sequence 7FAB light (variable region) in SwissProt and confirm
that it's got NH again, just as in fragment BS1, pos. 150-151.
(I'm not exactly sure about the difference between
"IG LAMBDA CHAIN V-VI REGION (NIG-48)" and
"IG LAMBDA CHAIN V-I REGION (NEWM)". The latter is the correct one.
If you've got the paper handy, you will note that this sequence
is exactly the one from the paper, whereas the one we obtained
from PDBFINDER differs in 4 positions !)
If you want to detect further problems, you can go on and retrieve what
seem to be the SwissProt equivalents of the 2FB4 heavy chain variable region (BS3) and 7FAB heavy chain variable region (BS4). Now they're farther away;
I guess this must have got something to do with the labels "V-III(KOL)" and
"V-II(NEWM)" of the SwissProt sequences. Can an immunologist help out ?
Exercise
[25, opt.] Try to find the 2FB4 light variable region in another databank, i.e. not in PDB/PDBFINDER.
This seems to be a challenge, and I couldn't find it, not even using Blast/Fasta searches.
If you find it, or know why it's not in SwissProt, the author will email you
a beer !
Exercise
[15, opt.] Try to find the 7FAB light constant region in another databank, i.e. not in PDB/PDBFINDER. Another challenge !
If you've done the last few exercises, you have got some justification to cite for your cutting point between the variable and constant regions of our sequences; equivalently you could have searched for the respective constant regions in SwissProt. This would give you the "exact" cutting points between the constant regions, too; BS7, BS5, and BS8 are in one SwissProt file (why ?), and the headers list the exact cut-points ! Or you can be as lazy as the author and use the cut-marks employed in the Barton & Sternberg paper, as follows. (For perfectionists, BS6 and BS7 got a few residues added, and BS8 got one residue deleted. Now they've got the same length as the ones in the original paper.)
>BS1, 7FAB light chain variable region ASVLTQPPSVSGAPGQRVTISCTGSSSNIGAGHNVKWYQQLPGTAPKLLIFHNNARFSVSKSGTSATLAITGLQAEDEAD YYCQSYDRSLRVFGGGTKLTVLR >BS2, 2FB4 light chain variable region QSVLTQPPSASGTPGQRVTISCSGTSSNIGSSTVNWYQQLPGMAPKLLIYRDAMRPSGVPDRFSGSKSGASASLAIGGLQ SEDETDYYCAAWDVSLNAYVFGTGTKVTVLGQ >BS3, 2FB4 heavy chain variable region EVQLVQSGGGVVQPGRSLRLSCSSSGFIFSSYAMYWVRQAPGKGLEWVAIIWDDGSDQHYADSVKGRFTISRNDSKNTLF LQMDSLRPEDTGVYFCARDGGHGFCSSASCFGPDYWGQGTPVTVSS >BS4, 7FAB heavy chain variable region AVQLEQSGPGLVRPSQTLSLTCTVSGTSFDDYYWTWVRQPPGRGLEWIGYVFYTGTTLLDPSLRGRVTMLVNTSKNQFSL RLSSVTAADTAVYYCARNLIAGGIDVWGQGSLVTVSS >BS5, 1FC1 heavy chain constant region PSVFLFPPKPKDTLMISRTPEVTCVVVDVSHEDPQVKFNWYVDGVQVHNAKTKPREQQYNSTYRVVSVLTVLHQNWLDGK EYKCKVSNKALPAPIEKTISKAKG >BS6, 7FAB light chain constant region QPKAAPSVTLFPPSSEELQANKATLVCLISDFYPGAVTVAWKADGSPVKAGVETTTPSKQSNNKYAASSYLSLTPEQWKS HKSYSCQVTHEGSTVEKTVAPtscs >BS7, 7FAB heavy chain constant region ASTKGPSVFPLAPTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHK PSNTKVDKKVEPksa >BS8, 1FC1 heavy chain constant region QPREPQVYTLPPSREEMTKNQVSLTCLVKGFYPSDIAVEWESNGQPENNYKTTPPVLDSDGSFFLYSKLTVDKSRWQQGN VFSCSVMHEALHNHYTQKSLSL
Let's start with an MSA alignment, since we cannot do any better than optimal if the underlying cost model is appropriate. However, we cannot get an MSA alignment on the net at the Washington University MSA server, http://ibc.wustl.edu/msa.html; the process gets killed after some time, we're using up too many resources ! All we can get is the heuristic alignment calculated by MSA (see 3.6).
Do not overload the Washington University MSA server by trying the MSA alignment yourself ! This time, only use the form with the "optimal alignment" option set to "off". If you've got MSA 1.0 or the newly released MSA 2.1 at your computer, and nobody's watching, you can try it out. Even with MSA 2.1, the author's workstation couldn't finish the job. I guess a supercomputer is needed ?!
Exercise
[10*] Nevertheless, the author has obtained an alignment
from the MSA server that he believes is "optimal". Very simple trick,
explained a little later ;-)
So, here's the optimal MSA alignment (well, sort-of...).
-----asVLTQPPsvsgapgqrvTISCTGsssnigag-hNVKWYqqlpgtapk--llifhnn----------arf -----qsVLTQPPsasgtpgqrvTISCSGtssnigs--sTVNWYqqlpgmapk--lliyrda---mrpsgvpdrf -----evQLVQSGggvvqpgrslRLSCSSsgfifss--yAMYWVrqapgkglewvaiiwddgsdqhyadsvkgrf -----avQLEQSGpglvrpsqtlSLTCTVsgtsfdd--yYWTWVrqppgrglewigyvfytg-ttlldpslrgrv -p--SVFLFPpkpkdtlmisrtpEVTCVVvdvshedpqvKFNWYvd--gvqvh--naKTKPR----------eqq qpkaapSVTLFPpsseelqankaTLVCLIsdfypga--vTVAWKadg-spvka--GVETTtp----------skq --------astkgpSVFPLAptaALGCLVkdyfpep--vTVSWNs---galts--GVHTFpa----------vlq qpr-epQVYTLPpsreemtknqvSLTCLVkgfypsd--iAVEWEsn--gqpen--NYKTTpp----------vld
SVSKSgTSAT--LAItglqaedeadYYC--QSYdr--------slr--VFGggtkltvlr- SGSKSgASAS--LAIgglqsedetdYYC--AAWdv--------slnayVFGtgtkvtvlgq TISRNdskNTLFLQMdslrpedtgvYFCARDgghgfcssascfgpd--YWGqgtpvtvss- TMLVNtskNQFSLRLssvtaadtavYYCARNliag--------gid--VWGqgslvtvss- ynstyrVVSV--LTVlhqnwldgkeYKC--KVSnk--------alp--apIEKtiskakg- snnkyaASSY--LSLtpeqwkshksYSC--QVThe--------gst----VEKtvaptscs ssglysLSSV--VTV-pssslgtqtYIC--NVNhk--------psn--tkVDKkvepksa- sdgsffLYSK--LTVdksrwqqgnvFSC--SVMhe--------alh--nhyTQKslsl---
We can easily recognize the correct alignment of the two Cysteine residues, and the Tryptophane. So this alignment is at least not completely off, i.e. it reproduces some features that a working immunologist easily recognizes. (Capitalized residues are part of the structurally verified alignment, see below.)
Exercise
[05, opt.] Get a colorful visualisation of the alignment, by using the
Weblogo server, http://www.bio.cam.ac.uk/cgi-bin/seqlogo/logo.cgi.
For your convenience, the
Fasta format of our alignment is available, see 3.7.
Although this does not do justice to Tom Schneider's "Sequence Logo" theory, we just note that the large characters in the Weblogo output denote the conserved residues.
[ ImmMsaEpsAll ] The "optimal" alignment is pretty much different from the heuristic one calculated by MSA before bogging down (see 3.6). Indeed, the polyhedron that needs to be explored is huge, as you can see from looking at the differences between the projected heuristic and the optimal pairwise alignments. (These differences give rise to the "compensation term" that is used to establish the Carrillo-Lipman bounds that in turn influence the polyhedron. See section 2.1 on the theory of the Carrillo-Lipman Bound). They are called "epsilon" in the following MSA 2.0 printout (this was printed out before the author's computer started the insurmountable task of exploring the polyhedron). "I" and "J" are, of course, the direction of the projection for which the difference is given.
----Estimated epsilons---- I = 1 J = 2 epsilon = 8 I = 1 J = 3 epsilon = 50 I = 1 J = 4 epsilon = 34 I = 1 J = 5 epsilon = 50 I = 1 J = 6 epsilon = 50 I = 1 J = 7 epsilon = 28 I = 1 J = 8 epsilon = 50 I = 2 J = 3 epsilon = 50 I = 2 J = 4 epsilon = 26 I = 2 J = 5 epsilon = 50 I = 2 J = 6 epsilon = 50 I = 2 J = 7 epsilon = 34 I = 2 J = 8 epsilon = 50 I = 3 J = 4 epsilon = 5 I = 3 J = 5 epsilon = 50 I = 3 J = 6 epsilon = 50 I = 3 J = 7 epsilon = 50 I = 3 J = 8 epsilon = 50 I = 4 J = 5 epsilon = 50 I = 4 J = 6 epsilon = 50 I = 4 J = 7 epsilon = 50 I = 4 J = 8 epsilon = 50 I = 5 J = 6 epsilon = 5 I = 5 J = 7 epsilon = 50 I = 5 J = 8 epsilon = 25 I = 6 J = 7 epsilon = 9 I = 6 J = 8 epsilon = 22 I = 7 J = 8 epsilon = 43Exercise
[05, opt.]
What does epsilon = 5 mean ? 5 units of what ? How has
this been standardized ? (Hint: See the next exercise.)
Epsilon = 50 is a threshold, larger values are just cut ! Therefore, it is possible that even if the computer were not bogged down, the full-size polyhedron would not have been explored, and the alignment would not necessarily have been optimal. The reason for all our trouble is now becoming clear: Our immunoglobulin sequences are too dissimilar to even suggest a heuristic alignment that is indeed close to the optimal one; "expert knowledge" at least about the Cys and Trp (W) residues is needed. We will soon see that the "optimal" MSA alignment (which the author obtained by cutting all sequences in two parts, and piecing the alignments together :-) is approximately as far away from the "biological truth" as MSA's heuristic one, and Clustal's (see below).
[ ImmMsaEpsPartial ] Although the MSA server does not inform you about the "epsilon"- values if the process gets killed, you can still get an idea of these values yourself, by submitting subsets. For example, aligning the constant regions only, the following information is returned (the alignment will be displayed and discussed below.)
Costfile: pam250 Alignment cost: 13103 Lower bound: 12933 Delta: 170 Max. Delta: 199 Sequences Proj. Cost Pair. Cost Epsilon Max. Epsi. Weight Weight*Cost 1 2 1672 1670 2 19 1 1672 1 3 1624 1622 2 5 1 1624 1 4 1656 1656 0 8 2 3312 2 3 1633 1586 47 41 2 3266 2 4 1608 1592 16 27 1 1608 3 4 1621 1565 56 50 1 1621 Elapsed time = 1.469
The quantity called epsilon in the last table is now called Max. Epsi. !
Exercise
[05A]
Given the information above, it's now easier to answer the question
``What does epsilon = 5 mean ?'' (see the last exercise).
Exercise
[05*A]
Which quantities do Lower bound, Proj. Cost, Pair. Cost, Delta and Max. Delta represent ?
Hint: The Delta's have to do with the epsilons, just looking
at their size.
Exercise
[10A]
Calculate the Carrillo-Lipman bound for pair (1,2), under the assumption
that the difference between the costs of the projected heuristic and the
pairwise optimal alignment for pair (3,4) is indeed 50 (i.e. that no
cutting down to 50 took place). Pairs (1,4) and (2,3) have weight 2 !
[ ImmStructVerif ]
Looking at the 3-dimensional structures of our protein domains, experts
have derived so-called "structurally verified alignments" for parts of them
(called "motifs" hereafter).
The following are listed in the Barton & Sternberg paper; they correspond to
the different
-chains of the immunoglobulin domains, and will be
taken as the "standard of truth".
A B C D E F G VLTQPP TISCTG NVKWY SVSKS TSATLAI YYCQSY VFG VLTQPP TISCSG TVNWY SGSKS ASASLAI YYCAAW VFG QLVQSG RLSCSS AMYWV TISRN NTLFLQM YFCARD YWG QLEQSG SLTCTV YWTWV TMLVN NQFSLRL YYCARN VWG SVFLFP EVTCVV KFNWY KTKPR VVSVLTV YKCKVS IEK SVTLFP TLVCLI TVAWK GVETT ASSYLSL YSCQVT VEK SVFPLA ALGCLV TVSWN GVHTF LSSVVTV YICNVN VDK QVYTLP SLTCLV AVEWE NYKTT LYSKLTV FSCSVM TQK
Exercise
[10*] Looking at the "optimal" MSA
alignment, which
-chains were aligned correctly ? How many residues
were misaligned ? For the latter, count residues as misaligned
if they don't align with the majority of residues that are following the
column of a motif, and count all residues if the column got completely
scrambled (i.e. if there are no two residues that are aligned according to
the motif). In other words, for all of the 38 columns displayed above,
look whether you can at least identify a relative majority of residues aligned
in the same way, and count those residues that are not aligned to them.
Exercise
[05] Take a look at MSA's heuristic alignment
(see 3.6), and/or its Weblogo diagram.
Compared to the "standard-of-truth" data,
which seemingly conserved residue is just an artifact,
i.e. the result of misalignments ?
[ ImmClustal ] Here is the Clustal alignment (again using the BCM Search Launcher, http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align-vsns.html, 1995 default settings), for comparison. The motifs are given in capital letters, but it's nevertheless a good idea to print out the alignment, and put the beta-sheets into boxes in the same way as in the Barton & Sternberg paper (p.331).
-----asVLTQPPsv--sgapgqrvTISCTGsssnigag-hNVKWYqqlpg--tapkllifhnnar--------- -----qsVLTQPPsa--sgtpgqrvTISCSGtssnigs--sTVNWYqqlpg--mapklliyrdamrpsgvpdr-- -----evQLVQSGgg--vvqpgrslRLSCS-Ssgfifss-yAMYWVrqapgkglewvaiiwddgsdqhyadsvkg -----avQLEQSGpg--lvrpsqtlSLTCT-Vsgtsfdd-yYWTWVrqppgrglewigyvfytg-ttlldpslrg -----pSVFLFPpkpkdtlmisrtpEVTCVVvdvshedpqvKFNWYvd--g----vqvhnaKTKPReqq------ qpkaapSVTLFPpss--eelqankaTLVCL-Isdfypga-vTVAWKad--g---spvkaGVETTtpsk------- -------astkgpS---VFPLAptaALGCL-Vkdyfpep-vTVSWNsg-------altsGVHTFpavlq------ qpre-pQVYTLPpsr--eemtknqvSLTCL-Vkgfypsd-iAVEWEsn--g----qpenNYKTTppvld------
-fSVSK--SgTSATLAItglqaedeadYYCQ--------SYdrslr--VFGggtkltvlr- -fSGSK--SgASASLAIgglqsedetdYYCA--------AWdvslnayVFGtgtkvtvlgq rfTISRNdskNTLFLQMdslrpedtgvYFCARDgghgfcssascfgpdYWGqgtpvtvss- rvTMLVNtskNQFSLRLssvtaadtavYYCARN--------liaggidVWGqgslvtvss- --ynst--yrVVSVLTVlhqnwldgkeYKCK----------VSnkalpapIEKtiskakg- -qsnnk--yaASSYLSLtpeqwkshksYSCQ----------VThegstVEKtvaptscs-- ---ssg--lysLSSVVTVpssslgtqtYICN----------VNhkpsntkVDKkvepksa- --sdgs--ffLYSKLTVdksrwqqgnvFSCS----------VMhea--lhnhyTQKslsl-
Exercise
[05, opt.] Obtain the Clustal alignment from the WWW.
Exercise
[05A] In the Clustal alignment, which motifs were aligned correctly ?
How many residues were misaligned ? (See Exercise 53.)
Exercise
[10, opt.]
Find out about the tree along which Clustal did the alignment.
(Unfortunately, the WWW Forms I know
do not return a picture of the tree along which Clustal aligned;
you need to interpret or convert the text description of the tree
returned by the Washington University server,
unless you have Clustal/Phylip on your computer.
Alternatively, you may look at the tree from ETH Zurich's
All-All service, http://cbrg.inf.ethz.ch/subsection3_1_1.html,
i.e. Fig. 13. Topologically, it's the same as the Clustal
tree.)
[ ImmMsaPartial ] Let us now align variable and constant regions alone;
asVLTQPPsvsgapgqrvTISCTGsssnigaghNVKWYqqlpgtapkll--ifhnn----------arfSVSKSg qsVLTQPPsasgtpgqrvTISCSGtssnig-ssTVNWYqqlpgmapkll--iyrda---mrpsgvpdrfSGSKSg evQLVQSGggvvqpgrslRLSCSSsgfifs-syAMYWVrqapgkglewvaiiwddgsdqhyadsvkgrfTISRNd avQLEQSGpglvrpsqtlSLTCTVsgtsfd-dyYWTWVrqppgrglewigyvfytg-ttlldpslrgrvTMLVNt
TSAT--LAItglqaedeadYYCQS--------Ydrslr--VFGggtkltvlr- ASAS--LAIgglqsedetdYYCAA--------WdvslnayVFGtgtkvtvlgq skNTLFLQMdslrpedtgvYFCARDgghgfcssascfgpdYWGqgtpvtvss- skNQFSLRLssvtaadtavYYCAR--------NliaggidVWGqgslvtvss-
is the optimal MSA alignment of the variable regions, and
-----pSVFLFPpkpkdtlmisrtpEVTCVVvdvshedpqvKFNWYvdgvqv-hnaKTKPReqqynstyrVVSVL qpkaapSVTLFPpssee--lqankaTLVCLIsdfypga--vTVAWKadgspvkaGVETTtpskqsnnkyaASSYL ----------astkgpSVFPLAptaALGCLVkdyfpep--vTVSWNsgalt--sGVHTFpavlqssglysLSSVV qpre-pQVYTLPpsree--mtknqvSLTCLVkgfypsd--iAVEWEsngqpe-nNYKTTppvldsdgsffLYSKL
TVlhqnwldgkeYKCKVSnkalpapIEKtiskakg- SLtpeqwkshksYSCQVTheg--stVEKtvaptscs TV-pssslgtqtYICNVNhkpsntkVDKkvepksa- TVdksrwqqgnvFSCSVMhealhnhyTQKslsl---
is the optimal MSA alignment of the constant regions. Calculating the accuracy for these 2 alignments separately, we count 9 misaligned residues in the variable regions, and 14 in the constant regions, from a total of 152 each. Modifying the accuracy scores of the 8-sequence-alignments (see exercise 53), counting only misaligned amino acids within one group (either constant, or variable), we obtain 14 and 20 errors, respectively. These alignments were made with the help of the other group of sequences, and in fact multiple alignment deteriorates accuracy scores ! Barton & Sternberg perform a more detailed analysis, comparing the scores of all pairwise alignments within one group (without taking the other sequences into consideration) with the accuracy obtained from the 8-sequence alignment, and observe the same deterioration.
[ ImmClustalPartial ] The same phenomenon can be observed using Clustal alignments, viz.
asVLTQPPsvsgapgqrvTISCTGsssnigaghNVKWYqqlpg--tapkllifhnnar----------fSVSK-- qsVLTQPPsasgtpgqrvTISCSGtssnigs-sTVNWYqqlpg--mapklliyrdamrpsgvpdr---fSGSK-- evQLVQSGggvvqpgrslRLSCS-SsgfifssyAMYWVrqapgkglewvaiiwddgsdqhyadsvkgrfTISRNd avQLEQSGpglvrpsqtlSLTCT-VsgtsfddyYWTWVrqppgrglewigyvfytg-ttlldpslrgrvTMLVNt
SgTSATLAItglqaedeadYYCQSY--------drslr--VFGggtkltvlr- SgASASLAIgglqsedetdYYCAAW--------dvslnayVFGtgtkvtvlgq skNTLFLQMdslrpedtgvYFCARDgghgfcssascfgpdYWGqgtpvtvss- skNQFSLRLssvtaadtavYYCARN--------liaggidVWGqgslvtvss-
is the Clustal alignment of the variable regions, and
-----pSVFLFPpkpkdtlmisrtpEVTCVVvdvshedpqvKFNWYvdgvqvhn-aKTKPReqqynstyrVVSVL qpkaapSVTLFPpsse--elqankaTLVCLIsdfypg--avTVAWKadgspvkaGVETTtpskqsnnkyaASSYL astkgpSVFPLApt----------aALGCLVkdyfpe--pvTVSWN-sgaltsG-VHTFpavlqssglysLSSVV qpre-pQVYTLPpsre--emtknqvSLTCLVkgfyps--diAVEWEsngqpenN-YKTTppvldsdgsffLYSKL
TVlhqnwldgkeYKCKVSnkalpapIEKtiskakg- SLtpeqwkshksYSCQVTheg--stVEKtvaptscs TVpssslgt-qtYICNVNhkpsntkVDKkvepksa- TVdksrwqqgnvFSCSVMhea--lhnhyTQKslsl-
is the Clustal alignment of the constant regions. Calculating accuracy for this case, we observe only 4 misaligned residues in the variable regions, and only 9 misaligned residues in the constant regions.
[ ImmQualityAndRelatedness ] If multiple alignment gives us worse results, why bother with it ? As the previous examples show, distant sequences can have a malign influence on the alignment of more related sequences, but we are hopeful that by adding related sequences, we can improve the alignment of distant sequences.
Indeed, the Clustal alignment of BS3 and BS8 is as follows,
evQLVQSGggvvqpgr------slRLSCSSsgfifssyAMYWVr-qapgkglewvaiiwd-dgsdqhyadsvkgr --qprepQVYTLPpsreemtknqvSLTCLVkgfypsdiAVEWEsngqpenNYKTTppvldsdgs----------f
fTISRNdskNTLFLQMdslrpedtgvYFCARDgghgfcssascfgpdYWGqgtpvtvss fLYSK--------LTVdksr---------wqqgnvFSCSVMhealhnhyTQKslsl---
It contains 24 misaligned residues (out of 38), and it's obvious that adding related sequences here improves the alignment significantly. Using their own alignment method, Barton & Sternberg perform all pairwise alignments, one variable aligned to one constant, and note that adding the remaining 6 sequences and aligning multiply improves accuracy from 41 to 63 percent, on average.
Exercise
[15, opt.] Use Geoffrey Barton's
AMAS utility, http://geoff.biop.ox.ac.uk/servers/amas_server.html,
to analyse the multiple alignments
from this section. Start with the "optimal" MSA alignment we pieced
together. AMAS will give you an idea of the physical
properties that are conserved at various positions. Can you find
residues with hydrophobic properties at
separated by unconserved or hydrophilic residues at
?
Such a pattern is typical for a surface
strand.
AMAS currently accepts FASTA format, provided that you add the
character "*" to the end of each sequence, like this:
>BS1, 7FAB light chain variable region -----ASVLTQPPSVSGAPGQRVTISCTGSSSNIGAG-HNVKWYQQLPGTAPK--LLIFHNN----------ARF SVSKSGTSAT--LAITGLQAEDEADYYC--QSYDR--------SLR--VFGGGTKLTVLR-* >BS2, 2FB4 light chain variable region -----QSVLTQPPSASGTPGQRVTISCSGTSSNIGS--STVNWYQQLPGMAPK--LLIYRDA---MRPSGVPDRF SGSKSGASAS--LAIGGLQSEDETDYYC--AAWDV--------SLNAYVFGTGTKVTVLGQ* [...]
Review papers with an emphasis on heuristic multiple alignment are [CWC92] and [MVF94], the latter comparing the results of various implementations on 4 standard datasets. ClustalW is described in [THG94]. For MSA references, see the theory part of this chapter. A general survey on the sequence analysis of immunoglobulins is given in [Wil87]. Some papers dealing with the alignment of immunological sequences are [Tay86], [BaS87] (of course!), and [ViA91].
Here's the heuristic alignment that is calculated by the MSA preprocessing; the author is currently looking for some exact documentation. [LAK89] write about the MSA 1.0 implementation, that they use "a progressive alignment strategy similar to those described by Waterman and Perlwitz [WaP84], Feng and Doolittle [FeD87] and Taylor [Tay87]". "Progressive alignment" obviously refers to the "Once a gap, always a gap" rule mentioned above. However, the MSA 2.0 paper [GKS95] offers a different description ?!
ASVLTQPPSVSGAPG--------QRVTISCTGSSSNIGAGHNV--KWYQQLPGTAPK---LLIFHNN-------- QSVLTQPPSASGTPG--------QRVTISCSGTSSNIGSS-TV--NWYQQLPGMAPK---LLIYRDAM--RPSGV EVQLVQSGGGVVQPG--------RSLRLSCSSSGFIFSSY-AM--YWVRQAPGKGLEWVAIIWDDGSDQHYADSV AVQLEQSGPGLVRPS--------QTLSLTCTVSGTSFDDY-YW--TWVRQPPGRGLEWIGYVFYTGTT-LLDPSL ------PSVFLFPPKPKDTLMISRTPEVTCVVVDVSHEDP-QVKFNWYVDGVQVHNA----KTKPREQ------- -QPKAAPSVTLFPPSSEE--LQANKATLVCLISDFYPGAV-TV--AWKADGSPVKAG---VETTTPSK------- -ASTKGPSVFPLAPT----------AALGCLVKDYFPEPV-TV--SW--NSGALTSG---VHTFPAVL------- --QPREPQVYTLPPSREE--MTKNQVSLTCLVKGFYPSDI-AV--EW-ESNGQPENN---YKTTPPVL-------
-ARFS--VSKSGTSATLAITGLQAEDEADYYCQSYDRSL--------R--VFGGGTKLTVLR-- PDRFS--GSKSGASASLAIGGLQSEDETDYYCAAWDVSL--------NAYVFGTGTKVTVLGQ- KGRFTISRNDSKNTLFLQMDSLRPEDTGVYFCARDGGHGFCSSASCFGPDYWGQGTPVTVSS-- RGRVTMLVNTSKNQFSLRLSSVTAADTAVYYCARNLIAG--------GIDVWGQGSLVTVSS-- -QYNS--TYRVVSVLTVLHQNWLDGK--EYKCKVSNKAL--------P---APIEKTISKAKG- -QSNN--KYAASSYLSLTPEQWKSHK--SYSCQVTHEG-------------STVEKTVAPTSCS -QSSG--LYSLSSVVTVPSSSLGTQ---TYICNVNHKPS--------N---TKVDKKVEPKSA- -DSDG--SFFLYSKLTVDKSRWQQGN--VFSCSVMHEAL--------H---NHYTQKSLSL---
>BS1, 7FAB light chain variable region -----ASVLTQPPSVSGAPGQRVTISCTGSSSNIGAG-HNVKWYQQLPGTAPK--LLIFHNN----------ARF SVSKSGTSAT--LAITGLQAEDEADYYC--QSYDR--------SLR--VFGGGTKLTVLR- >BS2, 2FB4 light chain variable region -----QSVLTQPPSASGTPGQRVTISCSGTSSNIGS--STVNWYQQLPGMAPK--LLIYRDA---MRPSGVPDRF SGSKSGASAS--LAIGGLQSEDETDYYC--AAWDV--------SLNAYVFGTGTKVTVLGQ >BS3, 2FB4 heavy chain variable region -----EVQLVQSGGGVVQPGRSLRLSCSSSGFIFSS--YAMYWVRQAPGKGLEWVAIIWDDGSDQHYADSVKGRF TISRNDSKNTLFLQMDSLRPEDTGVYFCARDGGHGFCSSASCFGPD--YWGQGTPVTVSS- >BS4, 7FAB heavy chain variable region -----AVQLEQSGPGLVRPSQTLSLTCTVSGTSFDD--YYWTWVRQPPGRGLEWIGYVFYTG-TTLLDPSLRGRV TMLVNTSKNQFSLRLSSVTAADTAVYYCARNLIAG--------GID--VWGQGSLVTVSS- >BS5, 1FC1 heavy chain constant region -P--SVFLFPPKPKDTLMISRTPEVTCVVVDVSHEDPQVKFNWYVD--GVQVH--NAKTKPR----------EQQ YNSTYRVVSV--LTVLHQNWLDGKEYKC--KVSNK--------ALP--APIEKTISKAKG- >BS6, 7FAB light chain constant region QPKAAPSVTLFPPSSEELQANKATLVCLISDFYPGA--VTVAWKADG-SPVKA--GVETTTP----------SKQ SNNKYAASSY--LSLTPEQWKSHKSYSC--QVTHE--------GST----VEKTVAPTSCS >BS7, 7FAB heavy chain constant region --------ASTKGPSVFPLAPTAALGCLVKDYFPEP--VTVSWNSG--GALTS--GVHTFPA----------VLQ SSGLYSLSSV--VTV-PSSSLGTQTYIC--NVNHK--------PSN--TKVDKKVEPKSA- >BS8, 1FC1 heavy chain constant region QPR-EPQVYTLPPSREEMTKNQVSLTCLVKGFYPSD--IAVEWESN--GQPEN--NYKTTPP----------VLD SDGSFFLYSK--LTVDKSRWQQGNVFSC--SVMHE--------ALH--NHYTQKSLSL---
This work was supported by the Association for the Promotion of Science and Humanities in Germany (Stifterverband für die Deutsche Wissenschaft). Peter Serocka from the Visualization Laboratory of the Research Center for Studies on Structure Formation has been very helpful with the preparation of several figures. Mandy Caird of the University of Colorado, and Chris Kiesewetter have offered invaluable technical assistance. I'd like to thank Hershel Safer, Rolf Engstrand, Gerard Pujadas, Robert Giegerich, Geoff Barton, Rebecca Parsons, Andrea Schafferhans, Peter Hjelmstrom, Jotun Hein, Jürgen Frey, Wolfram Altenhofen, Fredj Tekaia, Christian Büschking [more to follow] for valuable comments on the manuscript.
Back to VSNS BioComputing Division Home Page.
VSNS-BCD Copyright 1995, 1996.
Georg Fuellen