Here is what you will learn in the following sections: Multiple Alignment will be defined, we will discuss the straightforward approach to calculating an "optimal" one, and you will understand that, unlike pairwise alignment, even an intelligent order of computation is not enough to finish the calculation in reasonable time.
[ WhatIs. ] What is a multiple alignment ? The short answer is this -
VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGTSSNIGS--ITVNWYQQLPG
LRLSCSSSGFIFSS--YAMYWVRQAPG
LSLTCTVSGTSFDD--YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG--
ATLVCLISDFYPGA--VTVAWKADS--
AALGCLVKDYFPEP--VTVSWNSG---
VSLTCLVKGFYPSD--IAVEWESNG--
That's a multiple alignment. 8 fragments from immunoglobulin sequences are displayed together. Their alignment highlights conserved residues (one of the cysteines forming the disulphide bridges, and the tryptophan are notable), conserved regions (in particular, "Q.PG" at the end of the first 4 sequences), and more sophisticated patterns, like the dominance of hydrophobic residues at fragment positions 1 and 3. The alternating hydrophobicity pattern is typical for the surface beta-strand at the beginning of each fragment. Indeed, multiple alignments are helpful for protein structure prediction.
The alignment can also enable us to infer the evolutionary history of the sequences. It looks like the first 4 sequences and the last 4 sequences are derived from 2 different common ancestors, that in turn derived from a "root" ancestor. Indeed, we've got 4 fragments from the so-called variable regions, and 4 fragments from the constant regions of immunoglobulins. (Don't be fooled by the term "variable". The sequences of the variable regions are about as conserved as the sequences of the constant regions, except for their antigen-binding subregions, which are composed of just a few amino acids each, and give the antibody its specificity.) However, it is necessary to inspect longer fragments than shown here, if you want to make phylogenetic observations that are statistically significant. In this chapter, we will not look into phylogenetic analysis, or statistical significance in general; they deserve chapters of their own !
Exercise
[05,opt.] Find out more about the role that hydrophobic and hydrophilic residues
play in protein structures. Also, do you think the G at the end of
all but one fragment is a residue conserved over time ?
Exercises are marked with numbers that are (hopefully !) proportional to their difficulty. Those that are clearly optional are marked by the acronym 'opt'. Also, an appended symbol '*' marks the essential exercises, 'M' marks more mathematical depth (relative to no depth at all :-), 'B' marks more biological depth, 'A' marks assignments you are asked to submit to your instructor, and an appended letter 'P' marks things you can write a program for (preferably in MOO-code, so that you can demonstrate it in our electronic classroom. Contact the author at fuellen@dali.mathematik.uni-bielefeld.de before you start coding ! ).
In case you're wondering about the '[ WhatIs. ]' at the beginning of this section: I've provided acronyms so that you can clearly specify which part of the text you're talking about in the electronic classroom. This is paragraph "WhatIs-6" (the sixth paragraph in the "WhatIs" section), and no matter whether the others studied the hypertext or the postscript version, they will all find it.
[ FormalDef ] Computer scientists and mathematicians prefer a longer, more formal answer to our question "What is a Multiple Alignment ?", as follows:
Let's imagine we've got
sequences,
,
each sequence consisting of characters taken from an alphabet
of letters, denoted
(see chapter 1).
can be { A,C,G,T} for DNA sequences.
Let's say
must be at least 2,
since aligning zero or one sequences just doesn't make sense.
(If you're a good observer, you will note that most of the chapter
doesn't make sense for 2 sequences either. Well, it makes certain
sense by boiling down to useless, yet obviously true observations for the
pairwise case. However, in practice, multiple alignment is never done for less than 3 sequences.)
Next, we need to insist
that
does not contain the special character "-",
which we want to reserve for denoting the gaps.
Then we can write our alignments using the alphabet
, which is
plus the gap character "-".
can be { A,C,G,T,- }
for DNA sequence alignments. We define -
A Multiple Alignment of
sequences
is a rectangular array, consisting of characters taken
from the alphabet
, that satisfies the following 3 conditions:
1. There are exactly
rows.
2. Ignoring the gap character, row number
is exactly the sequence
.
3. Each column contains at least one character different from "-".
If the sequences are written as

then the multiple alignment will be written as

Exercise
[00] Can you imagine what
is, before
reading on ?
Some of the
are gaps, and ignoring them, the rows
become the original sequences
, like sequence No. 8 from the example
alignment:
Before:
V S L T C L V K G F Y P S D - - I A V E W E S N G - -
After:
V S L T C L V K G F Y P S D I A V E W E S N G.
Depending on how many gaps we inserted, the multiple alignment array has got a particular
width, which we have called
.
[ Compare ] Our next step is to compare multiple alignments. Our definition tells us what a multiple alignment is, but not whether the one from the beginning of the chapter,
VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGTSSNIGS--ITVNWYQQLPG
LRLSCSSSGFIFSS--YAMYWVRQAPG
LSLTCTVSGTSFDD--YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG--
ATLVCLISDFYPGA--VTVAWKADS--
AALGCLVKDYFPEP--VTVSWNSG---
VSLTCLVKGFYPSD--IAVEWESNG--
or the following one,
VTISCTGSSSNIG-AGNHVKWYQQLPG
VTISCTGTSSNIG--SITVNWYQQLPG
LRLSCSSSGFIFS--SYAMYWVRQAPG
LSLTCTVSGTSFD--DYYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNW--YVDG
ATLVCLISDFYPG--AVTVAW--KADS
AALGCLVKDYFPE--PVTVSW--NS-G
VSLTCLVKGFYPS--DIAVEW--ESNG
or, maybe, even this candidate
VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGTSSNIGS--ITVNWYQQLPG
LRLSCS-SSGFIFSS-YAMYWVRQAPG
LSLTCT-VSGTSFDD-YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG--
ATLVCLISDFYPGA--VTVAWKADS--
AALGCLVKDYFPEP--VTVSWNSG---
VSLTCLVKGFYPSD--IAVEWESNG--
is the "better" alignment. What is "better", anyway ? We need a cost/weight function, and a really simple way to start is to evaluate the costs/weights column by column, like this:

I have not made up the value "26", but I've used the "unit costs" from chapter 1 about pairwise alignment, summing up all the costs of all possible pairs of letters, i.e. the sum of the unit costs of the pairs (1,2), (1,3), (1,4), ..., (1,8), (2,3), (2,4), ..., (2,8), (3,4), (3,5), ..., ..., ..., and (7,8). In general, any cost/weight scheme could be used, it just needs to map pairs of characters to a numeric value. "Unit costs" is just a very convenient example.
Exercise
[00] Using unit costs, recalculate "26" for
yourself.
Exercise
[10A] Use the 250 PAM similarity matrix to calculate
the cost of the column. To do this,
convert the PAM similarity scores into costs.
For any pair of amino acids, calculate 17 minus the PAM
score of that pair.
("17" is the largest value in the standard PAM similarity
matrix. This is how the MSA software does the conversion, see below.)
For the comparison with "-", e.g. pairs like (A,-),
use a cost of 20.
The 250 PAM matrix is available at
http://www.techfak.uni-bielefeld.de/bcd/Curric/PrwAli/Matrices/pam250.mat.
In this text, our goal is to minimise the total cost, or distance, for the alignment, so smaller is better. In contrast, sometimes people aim to maximise similarity. These people can use the PAM similarity matrix directly in their calculations.
[ ColumnsFirst ]
A column is just a generalisation of an edit operation
as introduced in Chapter 1. Indeed, we can view the
operations "Replace/Match",
"Insert", and "Delete" as pairs of characters, i.e.
,
and
where
and
Therefore, the edit operations can be represented by columns with 2 entries.
The replacement of one character by a different character is also called a mismatch.
Exercise
[00] Not each column with 2 entries represents an edit operation, though.
Which one does not ?
Given a cost/weight schema
mapping pairs of characters
to a numeric value, we can calculate a cost of the overall alignment, by summing up
the column costs:

where
,
is the width of the alignment, and
is the set
of all pairs, where the first index
is smaller than the second index
. It goes without saying that
are larger than zero,
and less than or equal to the number of sequences,
.
Our cost model is a simple example of what is known as a sum-of-pairs cost, abbreviated to SP-cost, and we can think of the pairs as being "projections" of the column, in the same way as objects in a cubic lattice (3 dimensions) can be projected by light sources, forming shadows on the 6 different faces of the cube (2 dimensions). "Sum-of-pairs" is just one way of scoring. We'll touch another one later.
Exercise
[05P, opt.] In the "unit cost" model,
calculate the simple SP-cost of our example alignment from the beginning
of the chapter.
[ PairsFirst ] Equivalently, we can obtain the simple SP-cost of an alignment by calculating all pairwise costs in the same way as we did in chapter 1 (i.e., by looking at the number of replacements/matches, insertions and deletions), and then summing up over all pairs of sequences:
Exercise
[05P, opt.] Again, calculate the SP-cost of our example alignment from the beginning of the
chapter, according to the last equation.
I hope you obtained the same results for the last 2 exercises !
We've just rearranged the order of summation,
so this cost is the same as the one calculated before.
This rearrangement is possible because we've got a cost/weight scheme
defined on pairs of characters, and nothing more.
In most cases, only one of the 2 ways of summation can be used to express a particular cost model, and this makes a standard formulation of multiple alignment a little difficult. Some approaches are best cast "columns first", whereas others can only be cast "pairs first".
Exercise
[10A] Imagine you want to calculate the shortest common supersequence,
i.e. the shortest sequence to which all sequences can be aligned without mismatches.
Which cost/weight scheme can you use ?
Hint: a) Despite its name, the shortest common supersequence cannot be shorter than the longest of the sequences. b) Cast the problem "columns first", and imagine how the cost of a column
should be defined. If you do it right, there will be a simple relation between
the cost of the whole multiple alignment and the length of the
supersequence. (cf. [Kec93].)
Exercise
[05A] Come up with a cost model that must be treated "pairs first".
Hint: In the following alignment, is it fair to treat all 5 gaps as seperate entities ?
G-----S
GNAVSNS
GNANANS
Indeed, we will concentrate on the "pairs first" approach, and in the next section, a pair of rows will be called a projection of the whole multiple alignment, like a pair of letters is a projection of an alignment column.
Now we're ready to talk about "Optimal Multiple Alignments".
An Optimal Multiple Alignment is an alignment with minimum overall cost, or maximum overall
similarity.
We'll denote the cost of the optimal alignment of sequences
by
.
[ Lattice ] Recall the "path through a distance lattice/matrix" concept from Chapter 1. For the case of 3 sequences, every alignment can be cast as a unique path through a 3-dimensional lattice (See Fig. 1 for an example).

Figure 1: Alignment Path for 3 Sequences.
We can denote a path through a lattice in a simple way. For each node visited, we list, component by component, the distance from the starting point in the bottom left (i.e. from the source (0,0,0) of the lattice). For example, the path in Fig. 1 is written down as (0,0,0), (1,0,0), (2,1,0), (3,2,0), (3,3,1), (4,3,2). As you can see, the distance from the starting point is the number of letters already aligned. For example, column 4 of the alignment corresponds to node (3,3,1). Indeed, 3 letters from the first and second sequence are aligned at that point, and one letter from the third.
You can create alignment paths and 3-dimensional lattices, and rotate them in the window of your WWW-browser, using our very own Java-based "Multiple Alignment Visualization Tool", at http://bibiserv.techfak.uni-bielefeld.de/visualign/.
Exercise
[10, opt.] Draw the hyperlattice for the following alignment,
and write down the alignment path.
GN-S
GNA-
-N-S
Now I'd like to invite you to relax and imagine an 8-dimensional hyperlattice... In our example from the beginning of this chapter, we're first walking straight ahead into the lattice, following the main diagonal. Starting again at node (0,0,0,0,0,0,0,0) we move to node (1,1,1,1,1,1,1,1), and so forth, until we've reached the node (14,14,14,14,14,14,14,14). The next three nodes are as follows, (15,14,14,14,15,14,14,14), (15,14,14,14,16,14,14,14), and (16,15,15,15,17,15,15,15).
Exercise
[02*] And the next node is ??
Exercise
[05M, opt.]
Using the convention that for any letter L in A,
= -, and
, can you cast
an alignment column as the combination of a vector of binary numbers
and the letters at the incoming edges of a node in the lattice ?
(cf. [Wat89]).
Figure 2: Projection of the Alignment from the Right-Hand-Side.
In Fig. 1, a three-dimensional lattice is displayed, with sequences starting at the bottom-left end. If you imagine light sources on the top, front, and right-hand side of the lattice, "shadows" of the alignment will be projected to the opposing faces (walls). In Fig. 2, only the light source on the right is "on", projecting the path onto the face on the left. In Fig. 3, all light sources are "on". (The light sources should ideally be much farther away from the lattice, so that the shadows are projected without distortion.)
Figure 3: All 3 Pairwise Projections of the Alignment
[ ProjAndGaps ] We've already dealt with the projections of a column; here the whole alignment path is being projected ! The projections of a multiple alignment to pairwise alignments will play an important role in speeding up the calculation of the optimal one.
The projection of an alignment may be "shorter" than the original one. For example, the alignment of
G---SNS
GN----S
GNAVSNS
projected in the direction of the first 2 sequences, is as follows,
G-SNS
GN--S
Aligned gaps are ignored !
This is exactly what happens if the alignment path
progresses in the direction of the projection;
there is no shadow left ! For example, if the alignment path
in Fig. 3 at first
progresses towards the right of the hyperlattice, in the direction
of the first sequence, and is perpendicular to the other 2 sequences,
the projection by light source L1 is not a line, but a single point.
And indeed, "V" is aligned to 2 gaps.
Exercise
[02*A] Can the projected path of the optimal multiple alignment
ever be less costly than the optimal pairwise alignment ? Explain why, or
why not.
The lattice is the 2-dimensional equivalent of the hyperlattice introduced earlier, and each lattice node can be identified with the corresponding position in the "distance matrix" discussed in chapter 1. Once the calculation is finished, we have assigned a cost to each node. This cost is the minimum cost of aligning the two sequences, here VSN and SNA, up to the point defined by the node.
As in Fig. 1, but in contrast to chapter 1, we've arranged the sequences such that the calculation of the alignment costs starts in the bottom-left corner, and not the top-left corner. In this setting, one way to obey the order of dependencies is the following: Start bottom-left, then move to the right until the bottom row is finished, then visit the node marked by an asteriks (*), move to the right as before, etc.
Figure 4: Looking Back at Previously Visited Nodes.
The arrows in Fig. 4 denote the dependencies, e.g. the calculation for the node labeled "current visit" depends on 3 other nodes; we are looking back at 3 possible edges from which we can reach it. These, in turn, correspond to either a replacement/match (diagonal arrow), or the introduction of a gap in one of the sequences. Which of these 3 "edit operations" gives rise to the minimum overall cost for reaching the current node, and thereby suggests the best path to it, i.e. the best alignment of the sequences up to this point, VS and SN ? We have to look at
Extending this technique to 3 or more dimensions leads to formulas very much like the formulas from chapter 1. We've just got more nodes to look back, e.g. 7 nodes for three sequences. Correspondingly, the minimum needs to be taken from 7 possible values (see Fig. 5).
Figure 5: Looking Back in the Case of 3 Sequences.
Exercise
[10M, opt.] If you've done the last exercise in the "math track", you can
cast the recursive (dynamic programming) formula in a very elegant way, taking the minimum,
over all vectors of binary numbers
, of an expression dependent on
.
(cf. [Wat89]).
Whether we investigate the formulas, or just visualize all the nodes of a, say, 8-dimensional hyperlattice, we can easily see that our calculation is pretty much time- and space-consuming !
In the next subsection, we will formalize "pretty much", and then we'll look at a technique to cut down the number of nodes that need to be visited to calculate the alignment.
Exercise
[10, opt.] Imagine that you want to use a hill-climbing strategy
like Simulated Annealing, or a Genetic Algorithm, to find minimum-cost
alignments. Which obstacle(s) need to be overcome ?
(See Chapter 5 for an introduction to Genetic Algorithms. Cf. [Vin91], p.3).
-dimensional hyperlattice is visited once, and therefore
the running time must be proportional to the number of nodes in the lattice.
Looking at the 3-dimensional lattice we've used for visualisation,
this number is the product of the lengths of the sequences. The important
question remaining is thus: How many steps does the algorithm ``rest'' at
each node ? Dynamic programming organizes the visiting of nodes
in such a way that we just need to ``look back'' one single step,
at the nodes that we've visited before, to look up the values we need
for calculating the minimum. (See Fig. 4 and 5.)
So the time we spend for retrieving the minima
and calculating the sum does not depend on the length of the sequences !
However, it depends on the number of sequences. We've had 3 values (Fig. 4)
for 2 sequences, 7 values (Fig. 5) for 3 sequences, and, you may have
guessed it, 15 values for 4 sequences. This goes up exponentially, just
like the
! Using our reasoning, one can formally prove
that the running time is

where
denotes proportionality.
All this is bad news ! If the proportionality factor is 1 nanosecond, then
for 6 sequences of length 100, we'll have a running time of
, that's roughly 64000 seconds.
Just add 2 sequences, and the running time is
seconds !
Exercise
[02A] What's the running time for 10 sequences of length 200 ?
Even worse, let's have a look at the memory space we're using... If we want to trace back the alignment, we need to store the whole lattice, a datastructure the size of a multidimensional skyscraper. In fact, space is the No.1 problem here, bogging down multiple alignment methods that try to achieve optimality. Furthermore, incorporating a realistic gap model (see chapter 1), we will further increase our demands on space and running time, although there is a reasonable gap cost model that does not involve too many excess computations, see [Alt89].
Some introductory texts on multiple alignment are the following. Chapter 10 from the new book by Michael Waterman [Wat95] gives you a good overview, including some more mathematics. [Wat89] is an older introductory text by the same author. Less mathematical, but still from the viewpoint of Computer Science is the text [Mye91]. Some papers with a nice introductory section are [GKS95], [Gus93], and [Kec93]. Reviews from the biologists' perspective are centered around heuristic methods; see section 3.5 for more hints. Two approaches to Multiple Alignment that are not treated here are the use of Hidden Markov Models ([BCHM94]), [KBM+94], [Edd95]) and Gibbs Sampling ([LAB+93]); both are based on statistical methods. The former is also discussed in [Wat95].
Back to VSNS BioComputing Division Home Page.
VSNS-BCD Copyright 1995, 1996.
Georg Fuellen