Evolutionary Computation is, like neural networks, an example par excellence for an information processing paradigm that was originally developed and exhibited by nature and later discovered by man who subsequently transformed the general principle into computational algorithms to be put to work on computers. Nature makes in an impressive way use of the principle of genetic heritage and evolution. Application of the simple concept of performance based reproduction of individuals („survival of the fittest“) led to the rise of well adapted organisms that can endure in a potentially adverse environment. Mutually beneficial interdependencies, co-operation and even apparently altruistic behaviour can emerge solely by evolution. The investigation of those phenomena is part of research in artificial life but cannot be dealt with in this book.
Evolutionary computation comprises the four main areas of Genetic Algorithms [1], Evolution Strategies [2], Genetic Programming [3]and Simulated Annealing [4]. Genetic algorithms and evolution strategies emerged at about the same time in the United States of America and Germany. Both techniques model the natural evolution process in order to optimise either a fitness function (evolution strategies) or the effort of generating subsequent, well-adapted individuals in successive generations (genetic algorithms). Evolution strategies in their original form were basically stochastic hill-climbing algorithms and used for optimisation of complex, multi-parameter objective functions that in practise cannot be treated analytically. Genetic algorithms in their original form were not primarily designed for function optimisation but to demonstrate the efficiency of genetic crossover in assembling successful candidates over complicated search spaces. Genetic programming takes the idea of solving an optimisation problem by evolution of potential candidates one step further in that not only the parameters of a problem but also the structure of a solution is subject to evolutionary change. Simulated Annealing is mathematically similar to evolution strategies. It was originally derived from a physical model of crystallisation. Only two individuals compete for the highest rank according to a fitness function and the decision about accepting suboptimal candidates is controlled stochastically.
All methods presented in this chapter are heuristic, i.e. they contain a random component. As a consequence (and in contrast to deterministic methods) it can never be guaranteed that the algorithm will find an optimal solution or even any solution at all. Evolutionary algorithms are therefore used preferably for applications were deterministic or analytic methods fail, for example, because the underlying mathematical model is not well defined or the search space is too large for systematic, complete search (n.-p. completeness). Another application area for evolutionary algorithms that is rapidly growing is the simulation of living systems starting with single cells and proceeding to organisms, societies or even whole economic systems [5], [6]. The goal of artificial life is not primarily to model biological life as accurately as possible but to investigate how our life or other, presumably different forms of life could have emerged from non-living components.
Work with evolutionary algorithms bears the potential for a philosophically and epistemologically interesting recursion. At the beginning, evolution emerged spontaneously in nature. Next, man discovers the principle of evolution and acquires knowledge on its mathematical properties. He defines genetic algorithms for computers. To complete the recursive cycle, computational genetic algorithms are applied to the very objects (DNA, proteins) of which they had been derived in the beginning. A practical example of such a meta-recursive application will be given below in the sections on protein folding. Figure 1 illustrates this interplay of natural and simulated evolution.

The so-called genetic algorithm [7] is a heuristic method that operates on pieces of information like nature does on genes in the course of evolution. Individuals are represented by a linear string of letters of an alphabet (in nature nucleotides, in genetic algorithms bits, characters, strings, numbers or other data structures) and they are allowed to mutate, crossover and reproduce. All individuals of one generation are evaluated by a fitness function. Depending on the generation replacement mode a subset of parents and offspring enters the next reproduction cycle. After a number of iterations the population consists of individuals that are well adapted in terms of the fitness function. Although this setting is reminiscent of a classical function optimisation problem genetic algorithms were originally designed to demonstrate the benefit of genetic crossover in an evolutionary scenario, not for function optimisation. It cannot be proven that the individuals of a final generation contain an optimal solution for the objective encoded in the fitness function but it can be shown mathematically that the genetic algorithm optimises the effort of testing and producing new individuals if their representation permits development of building blocks (also called schemata). In that case, the genetic algorithm is driven by an implicit parallelism and generates significantly more successful progeny than random search. In a number of applications where the search space was too large for other heuristic methods or too complex for analytic treatment genetic algorithms produced favourable results [8].
The basic outline of a genetic algorithm is as follows:

The mathematical foundation of genetic algorithms is the schemata theorem of J. H. Holland. It makes a statement about the propagation of schemata (or building blocks) within all individuals of one generation. A schema is implicitly contained in an individual. Like individuals, schemata consist of bit strings (1, 0) and can be as long as the individual itself. In addition, schemata may contain „don’t care“ positions where it is not specified whether the bit is 1 or 0, i.e. schemata H
are made from the alphabet {1, 0, #}. In other words, a schema is a generalisation of (parts of) an individual. For example, the individuals:
01010010100101010101110101010101 and
01011010100101110001110111010111
can be summarised by the schema:
0101#010100101#10#011101#10101#1
where all identical positions are retained and differing positions marked with a „#“ which stands for „don’t care“. The length
of the above schema is 31 which is one minus the distance from the first to the last fixed symbol (1 or 0 but not #). The total number of different schemata of length l over an alphabet of cardinality k is
. Because each string of length l contains
schemata, with n individuals in one generation there are between
and n
schemata in the population (depending on the similarity of the n individuals). The order of a schema o(H) is the number of fixed positions (1 or 0 but not #).
Let [9] s(H, t) be the number of occurrences of a particular schema H in a population of n individuals at time t. The bit string
of individual j gets then selected for reproduction with probability
:

where
is the fitness value of the j-th individual. The expected number of occurrences of schema H at time t+1 is:

with
(H) as the average fitness of all individuals (strings
) that contain H. Crossover and mutation operators can destroy schemata during reproduction. The longer a single individual, the smaller the probability that a schema H will be involved in a crossover event. The longer a schema, i.e. the larger
, the more likely is its destruction through recombination with another individual. Hence, for crossover the lower bound for the survival probability of a schema H is:

with L as the length of one whole individual. If we perform crossover stochastically at a frequency p
the survival probability p
becomes:

Summarising the effects of independent crossover and reproduction we arrive at the following equation for the expected occurrence of a schema H at time t+1:

This equation tells us that schemata increase over time proportional to their relative fitness and inverse proportional to their length. – Mutation can effect a schema H at each of its o(H) fixed positions with mutation probability p
. Survival of a single constant position in a schema is then p
= 1 - p
and survival of the entire schema:
which for small p
can be approximated by
. Summarising the effects of independent mutation, crossover and variation we get the following formula for the expected count of a schema H:

Assuming a schema H could always outperform other schemata by a fraction b of the total mean fitness then this equation can be rewritten as:

This equation is of the form
which says that the number of schemata better than average will exponentially increase over time. Effectively, many different schemata are sampled implicitly in parallel and good schemata will persist and grow. This is the basic rationale behind the genetic algorithm. It is suggested that if the (linear) representation of a problem allows the formation of schemata then the genetic algorithm can efficiently produce individuals that continuously improve in terms of the fitness function.
Let us examine the performance of the genetic algorithm on a simple application which is the search for the largest square product of an integer smaller than 32. Table 1 shows four initial individuals that were randomly generated. The bit strings of the individuals are decoded to unsigned integer values. The fitness function f (i) = i
is used to assign a fitness value to each individual. Depending on their relative fitness values the reproduction probability (between 0% and 100%) for each individual is calculated and converted into the number of expected successors. Then the so-called roulette wheel algorithm is used to perform a stochastic selection based on the reproduction probability. Three particular schemata and their occurence and distribution over the four individuals is monitored.
Table 2 shows the situation after reproduction. The individuals selected for reproduction have been replicated according to their relative fitness. Crossover sites and mating partners have been assigned randomly. To keep this example simple mutation is not used here. After performing crossover the new fitness values of the individuals in the new population are calculated. The performance of the three schemata H
, H
, and H
is also shown. Schema H
is of low fitness because it implies that the decoded integer is smaller than 8. Therefore, this schema gets only a small chance for reproduction. Actually, H
dies out as its only parent (the original individual #3) does not get selected for reproduction. Schemata H
and H
both have a reasonable chance for reproduction and are subsequently found in the new generation. Both average and best fitness values have significantly improved in the new generation.
In this example, we monitored only three schemata. There are, however, between
= 32 and 4 × 32 = 128 schemata in this small population that were all implicitly evaluated in the same manner in parallel only at the small computational cost of copying and exchanging a few bit strings. The implicit arithmetics of finding and promoting the best schemata do not actually have to be carried out by the computer. They are, so to speak, side effects of the genetic paradigm. This implicit parallelism is the basic reason for the efficiency of genetic algorithms.
As an addendum, the stochastic universal sampling algorithm for minimisation (as opposed to maximization) of fitness values by J. E. Baker is implemented below in the C programming language. This implementation is especially elegant because the source code is quite short and the generation of only one random number between 0 and 1 is needed to perform a random selection among all individuals in one generation according to their individual fitness values.
k = 0; /* k is an integer index of next individual to be selected */
ptr = Rand(); /* spin the roulette wheel; 0 < ptr < 1 */
Scaling_Factor = 1.0 / (Prev_Worst_Fitness - Average_Current_Fitness);
for (sum = i = 0; i < Popsize; i++)
/* Popsize is size of population */
{ /* Fitness[i] is the fitness value of individual i */
if (Fitness[i] < Prev_Worst_Fitness)
Expected_Count = (Prev_Worst_Fitness - Fitness[i]) * Scaling_Factor;
else Expected_Count = 0.0;
for (Sum += Expected_Count; Sum > ptr; ptr++)
sample[k++] = i;
/* sample is an array of „Popsize“ integers. */
} /* The value of each array element defines an individual. */
These instructions fill the array SAMPLE with integer values in a way that each individual with a fitness better than the average fitness of the current generation and better than the worst fitness in the last generation gets a chance of replication proportional to its fitness. Note the concerted increments of SUM and PTR in the inner FOR-loop.
I. Rechenberg in 1973 developed the so-called evolution strategy [10]. The central idea of a computational evolution strategy is to create from one (or, in general,
) „parent(s)“ one (or
) „offspring“ individuals that have been modified by a mutation operator.
and
are usually small integers. As with the genetic algorithm, parents and offspring represent potential solutions to an optimisation problem encoded as a set of numerical parameters. Either only the
offspring or both
parents plus
offspring individuals compete for survival into the next generation.
The mutation rate can be adjusted between successive generations to optimise the search progress. The so-called „1/5 rule“ predicts an optimal run time performance when about 20% of all mutations within one population produce „viable“ offspring. If more mutations produce successful offspring, the mutation rate is probably too small and larger mutations would be useful to cover more of the search space. If much less than 20% of all mutations lead to viable successors then it may be that the mutation rate is too large. A large mutation rate may have the effect that the individuals are scattered widely in search space and no convergence is found. In order to keep the mutation rate in an optimal region it was suggested to multiply (or divide, respectively) the current mutation rate by 1.25 between successive generations if the count of successful mutations falls below (exceeds) 20%. Mutation is a domain specific operator and can be implemented, for example, as a Gaussian distributed increment of a numerical parameter.
There are five subtle differences between a genetic algorithm and the evolution strategy.
In the recent past, both genetic algorithms and evolution strategy have gained wide public attention and many applications try to combine the best of both worlds [12].
This section describes the application of a genetic algorithm to the problem of protein structure prediction [13],[14],[15], with a simple force field as the fitness function. It is a continuation of work presented earlier [16]. Similar research on genetic algorithms and protein folding was done independently by several groups world wide [17]. Genetic algorithms have been used to predict optimal sequences to fit structural constraints [18], to fold Crambin in the Amber force field [19] and Mellitin in an empirical, statistical potential [20], and to predict main chain folding patterns of small proteins based on secondary structure predictions [21].
In this section the individuals of the genetic algorithm are conformations of a protein and the fitness function is a simple force field. In the following, the representation formalism, the fitness function and the genetic operators are described. Then, the results of an ab initio prediction run and of an experiment for side chain placement for the protein Crambin will be discussed.
1.2.1.1 Representation Formalism
For every application of a genetic algorithm one has to decide on a representation formalism for the „genes“. In this application, the so-called hybrid approach is taken [8]. This means that the genetic algorithm is configured to operate on numbers, not bit strings as in the original genetic algorithm. A hybrid representation is usually easier to implement and also facilitates the use of domain specific operators. However, three potential disadvantages are encountered:
It is not the principal goal of this application to find the single optimal conformation of a protein based on a force field but to generate a small set of native-like conformations. For this task the genetic algorithm is an appropriate tool. For a hybrid representation of proteins one can use Cartesian coordinates, torsion angles, rotamers or a simplified model of residues.
For a representation in Cartesian coordinates the 3-dimensional coordinates of all atoms in a protein are recorded. This representation has the advantage of being easily converted to and from the 3-dimensional conformation of a protein. However, it has the disadvantage that a mutation operator would in most instances create invalid protein conformations where some atoms lie too far apart or collide. Therefore a filter is needed which eliminates invalid individuals. Because such a filter would consume a disproportionate large amount of CPU time a Cartesian coordinate representation considerably slows down the search process of a genetic algorithm.
Another representation model is by torsion angles. Here, a protein is described by a set of torsion angles under the assumption of constant standard binding geometries. Bond lengths and bond angles are taken to be constant and cannot be changed by the genetic algorithm. This assumption is certainly a simplification of the real situation where bond length and bond angle to some extent depend on the environment of an atom. However, torsion angles provide enough degrees of freedom to represent any native conformation with only small r.m.s. [22] deviations.
Special to the torsion angle representation is the fact that even small changes in the
(phi) /
(psi) angles can induce large changes in the overall conformation. This is of advantage when creating variability within a population at the beginning of a run. Figure 1 explains the definition of the torsion angles
,
,
(omega),
1 (chi1) and
2 (chi2). A small fragment taken from a hypothetical protein is shown. Two basic building blocks, the amino acids phenylalanine (Phe) and glycine (Gly), are drawn as wire frame models. Atoms are labelled with their chemical symbols. Bonds in bold print indicate the backbone. The labels of torsion angles are placed next to their rotatable bonds.

In the present work the torsion angle representation is used. Torsion angles of 129 proteins from the Brookhaven database [23] (PDB) were statistically analysed for the definition of the MUTATE operator. The frequency of each torsion angle in intervals of 10° was determined and the ten most frequently occurring intervals are made available for substitution of individual torsion angles by the MUTATE operator. At the beginning of the run, individuals were initialised with either a completely extended conformation where all torsion angles are 180° or by a random selection from the ten most frequently occurring intervals of each torsion angle. For the
torsion angle the constant value of 180° was used because of the rigidity of the peptide bond between the atoms C
and N
. A statistical analysis of
angles shows that with the exception of proline average deviations from the mean of 180° occur rather frequently up to 5°, and only in rare cases up to 15°.
The genetic operators in this application operate on the torsion angle representation but the fitness function requires a protein conformation to be expressed in Cartesian coordinates. For the implementation of a conversion program bond angles were taken from the molecular modelling software Alchemy [24] and bond lengths from the program Charmm [25]. Either a complete form with explicit hydrogen atoms or the so-called extended atom representation with small groups of atoms represented as „super-atoms“ can be calculated. One conformation of a protein is encoded as an array of structures of the C programming language. The number of structures equals the number of residues in the protein. Each structure includes a three letter identifier of the residue type and ten floating point numbers for the torsion angles
,
,
,
1,
2,
3,
4,
5,
6, and
7. For residues with less than seven side chain torsion angles the extra fields are filled with a default value. The main chain torsion angle
was kept constant at 180°.
In this application a simple steric potential energy function was chosen as the fitness function (i.e. the objective function to be minimised). It is very difficult to find the global optimum of a potential energy function because of the large number of degrees of freedom even for a protein of average size. In general, molecules with n atoms have 3n - 6 degrees of freedom. For the case of a medium-sized protein of 100 residues this amounts to:
The steric potential energy function was adapted from the program Charmm. The total energy of a protein in solution is the sum of the expressions for
(bond length potential),
(bond angle potential),
(torsion angle potential),
(improper torsion angle potential),
(van der Waals pair interactions),
(electrostatic potential),
(hydrogen bonds), and of two expressions for interaction with the solvent,
and
:
The pseudo entropic term
for a conformation is a function of its actual diameter. The diameter is defined to be as the largest distance between any C_a atoms in one conformation. An exponential of the difference between actual and expected diameter is added to the potential energy if that difference is less than 15Ĺ. If the difference is greater than 15Ĺ a fixed amount of energy is added (10
kcal/mol) to avoid exponential overflow. If the actual diameter of an individual is smaller than the expected diameter,
is set to zero. The net result is that extended conformations have larger energy values and are therefore less fit for reproduction than globular conformations.
Occasionally, if two atoms are very close the
term can become very large. The maximum value for
in this case is 10
kcal/mol and the expressions for
and
are not calculated. Runs were performed with the potential energy function
as described above where lower fitness values mean fitter individuals and with a variant, where the four expressions
,
,
and
were given individual weights. The results were similar in all cases. Especially, scaling down the dominant effect of electrostatic interactions did not change the results (see below).
In order to combine individuals of one generation to produce new offspring nature as well as genetic algorithms apply several genetic operators. In the present work, individuals are protein conformations represented by a set of torsion angles under the assumption of constant standard binding geometries. Three operators are invented to modify these individuals: MUTATE, VARIATE and CROSSOVER. The decision about the application of an operator is made during run time and can be controlled by various parameters.
MUTATE. The first operator is the MUTATE operator. If MUTATE gets activated for a particular torsion angle, this angle will be replaced by a random choice of one of the ten most frequently occurring values for that type of residue. The decision, whether a torsion angle will be modified by MUTATE is made independently for each torsion angle in a protein. A random number between 0 and 1 is generated and if this number is greater than the MUTATE parameter at that time, MUTATE is applied. The MUTATE parameter can change dynamically during a run. The values that MUTATE can choose from come from a statistical analysis of 129 proteins from PDB. The number of instances in each of the 36 10°-intervals was counted for each torsion angle. The ten most frequent intervals, each represented by its left boundary, are available for substitution.
VARIATE. The VARIATE operator consists of three components: the 1°, 5° and 10° operator. Independently and after application of the MUTATE operator for each torsion angle in a protein two decisions are made: first, whether the VARIATE operator will be applied and, second, if so which of the three components shall be selected. The VARIATE operator increments or decrements (always an independent random chance of 1:1) the torsion angle by 1°, 5° or 10°. Care is taken that the range of torsion angles does not exceed the [-180°, 180°] interval. The probability of applying this operator is controlled by the VARIATE parameter which can change dynamically during run time. Similarly, three additional parameters control the probability for choosing among the three components. Alternatively, instead of three discrete increments a Gaussian uniformly distributed increment between 10° and +10° can be used.
CROSSOVER. The CROSSOVER operator has two components: the two point crossover and the uniform crossover. CROSSOVER is applied to two individuals independently of the MUTATE and VARIATE operators. First, individuals of the parent generation, possibly modified by MUTATE and VARIATE, are randomly grouped pairwise. For each pair, an independent decision is made whether or not to apply the CROSSOVER operator. The probability of this is controlled by a CROSSOVER parameter which can change dynamically during run time. If the decision is „no“, the two individuals are not further modified and added to the list of offspring. If the decision is „yes“, a choice between the two point crossover and the uniform crossover must be made. This decision is controlled by two other parameters that can also be changed during run time. The two point crossover selects randomly two residues on one of the individuals. Then the fragment between the two residues is exchanged with the corresponding fragment of the second individual. Alternatively, uniform crossover decides independently for each residue whether or not to exchange the torsion angles of that residue. The probability for an exchange is then always 50%.
Parameterization. As mentioned in the previous paragraphs, there are a number of parameters that control the run time behaviour of a genetic algorithm. The parameter values used for the experiments that will be presented in the „Results“ section below are summarised in Table 1. The main chain torsion angle
was kept constant at 180°. The initial generation was created by a random selection of torsion angles from a list of the ten most frequently occurring values for each angle. Ten individuals are in one generation. The genetic algorithm was halted after 1000 generations. At the start of the run, the probability for a torsion angle to be modified by the MUTATE operator is 80%, at the end of the run it becomes 20%. In between the probability decreases linearly with the number of generations. In contrast, the probability of applying the VARIATE operator increases from 20% at the beginning to 70% at the end of the run. The 10° component of the VARIATE operator is dominant at the start of the run (60%), whereas it is the 1° component at the end (80%). Likewise, the chance of performing a CROSSOVER rises from 10% to 70%. At the beginning of the run mainly uniform CROSSOVER is applied (90%), at the end it is mainly two point CROSSOVER (90%). This parameter setting uses a small number of individuals but runs over a large number of generations. This keeps computation time low while allowing a maximum number of crossover events. At the beginning of the run MUTATE and uniform CROSSOVER are applied most of the time to create some variety in the population so that many different regions of the search space are covered. At the end of the run the 1° component of the VARIATE operator dominates the scene. This is intended for fine tuning those conformations that have survived the selection pressure of evolution so far.
Generation Replacement. There are different ways how to select the individuals for the next generation. Given the constraint that the number of individuals should remain constant some individuals have to be discarded. Transition between generations can be done by total replacement, elitist replacement or steady state replacement. For total replacement only the newly created offspring enters the next generation and the parents of the previous generation are completely discarded. This has the disadvantage that a fit parent can be lost if it only once produces bad offspring. With elitist replacement all parents and offspring of one generation are sorted according to their fitness. If the size of the population is n, then the n fittest individuals are selected as parents for the following generation. This mode has been used here. Another variant is steady state replacement where two individuals are selected from the population based on their fitness and then modified by mutation and crossover. They are then used to replace their parents.
A prototype of a genetic algorithm with the representation, fitness function and operators as described above has been implemented. To evaluate the ab initio prediction performance of the genetic algorithm the sequence of Crambin was given to the program. Crambin is a plant seed protein from the cabbage Crambe abyssinica. Its structure was determined by W. A. Hendrickson and M. M. Teeter [30] to a resolution of 1.5Ĺ (Figures 2 and 3). Crambin has a strong amphiphilic character which makes its conformation especially difficult to predict. However, because of its good resolution and small size of 46 residues it was decided to use Crambin as a first candidate for prediction. The following structures are again displayed in stereo projection. If the observer manages to look cross eyed at the diagram in a way that superimposes both halves a 3-dimensional impression can be perceived.




Figure 4 shows two of the ten individuals in the last generation of the genetic algorithm. None of the ten individuals shows significant structural similarity to the native Crambin conformation. This can be confirmed by superimposing the generated structures with the native conformation. Table 2 shows the r.m.s. differences between all ten individuals and the native conformation. All values are in the range of 9Ĺ which rejects any significant structural homology.
The ten individuals of the last generation were measured against the native conformation of Crambin. R.m.s. values of around 9Ĺ for a small protein as Crambin exclude any significant structural similarity.
Although the genetic algorithm did not produce native-like conformations of Crambin, the generated backbone conformations could be those of a protein, i.e. they have no knots or unreasonably protruding extensions. The conformational results alone would indicate a complete failure of the genetic algorithm approach to conformational search but let us have a look at the energies in the final generation (Table 3). All individuals have a much lower energy than native Crambin in the same force field. That means that the genetic algorithm actually achieved a substantial optimisation but that the current fitness function was no good indicator for „nativeness“ of a conformation.
For each individual the van der Waals energy (
), electrostatic energy (
), torsion energy (
), pseudo entropic energy (
) and the sum of all terms (
) is shown. For comparison the values for native Crambin in the same force field are listed.
It is obvious that all individuals generated by the genetic algorithm have a much higher electrostatic potential than native Crambin. There are three reasons for this.
The final generation of only ten individuals contained two fundamentally different families of structures (class 1: P1, P2, P4, P5, P6, P8, P9) and (class 2: P3, P7, P10). Members of one class have a r.m.s. deviation of about 2 among each other but differ from members of the other class by about 9.
Taking into account the small population size, the significant increase in total energy of the individuals generated by the GA and the fact that the final generation contained two substantially different classes of conformations with very similar energies one is lead to the conclusion that the search performance of the genetic algorithm was not that bad at all. What remains a problem is to find a better fitness function that actually guides the genetic algorithm to native-like conformations. As the only criterion currently known to determine native conformation is the free energy the difficulty of this approach becomes obvious. One possible way to cope with the problem of inadequate fitness functions is to combine other heuristic criteria together with force field components in a multi-value vector fitness function. Before we turn to that approach let us first examine the performance of the current version for side chain placement.
Crystallographers often face the problem of positioning the side chains of a protein when the primary structure and the conformation of the backbone is known. At present, there is no method that automatically does side chain placement with sufficiently high accuracy for routine practical use. Although the side chain placement problem is conceptually easier than ab initio tertiary structure prediction it is still too complex for analytical treatment.
The genetic algorithm approach as described above can be used for side chain placement. The torsion angles
,
, and
have simply to be kept constant for a given backbone. Side chain placement by the genetic algorithm was done for Crambin. For each five residues, a superposition of the native and predicted conformation is shown in stereo projection graphs in Figure 5. As can be seen, the predictions are quite well in agreement with the native conformation in most cases. The overall r.m.s. difference in this example is 1.86Ĺ. This is not as good as but comparable to the results from a simulated annealing approach [31] (1.65Ĺ) and a heuristic approach (1.48 ) [32].









TTCCP SIVAR SNFNV CRLPG TPEAI CATYT GCIII PGATC PGDYA N
It must be emphasised that these runs were done without optimising either the force field parameters of the fitness function or the run time parameters of the genetic algorithm. From a more elaborate and fine-tuned experiment even better results should be expected.
In this section we will introduce additional fitness criteria for the protein folding application with genetic algorithms. The rationale behind is that a more of information about genuine protein conformations should improve the fitness function to guide the genetic algorithm towards native-like conformations. Some properties of protein conformations can be used as additional fitness components whereas others can be incorporated into genetic operators (e.g. constraints from the Ramachandran plot). For such an extended fitness function several incommensurable quantities will have to be combined: energy, preferred torsion angles, secondary structure propensities or distributions of polar and hydrophobic residues. This creates the problem of how to combine the different fitness contributions to arrive at the total fitness of a single individual. Simple summation of different components has the disadvantage that components with larger numbers would dominate the fitness function whether or not they are important or of any significance at all for a particular conformation. To cope with this difficulty individual weights for each of the components could be introduced. But this creates another problem. How should one determine useful values for these weights? As there is no general theory known for the proper weighting of each fitness component the only way is to try different combinations of values and evaluate them by their performance of a genetic algorithm on test proteins with known conformations. However, even for a small number of fitness components a large number of combinations of weights arises which requires as many test runs for evaluation. Also, „expensive“ fitness components as the van der Waals energy need considerable computation time. In this work [33] two measures were taken to deal with this situation:
In this application two versions of a fitness function are used. One version is a scalar fitness function that calculates the r.m.s.-deviation of a newly generated individual from the known conformation of the test protein. This geometric measure should guide the genetic algorithm directly to the desired solution but it is only available for proteins with a known conformation. R.m.s.-deviation is calculated as follows:

Here i is the index over all corresponding N atoms in the two structures to be compared, in this case the conformation of an individual (u
) in the current population and the known, actual structure (v
) of the test protein. The squares of the distances between the vectors u
and v
of corresponding atoms are summed and the square root is taken. The result is a measure of how much each atom in the individual deviates on average from its true position. R.m.s. values of 0 - 3 signify strong structural similarity; values of 4 - 6 Ĺ denote weak structural similarity whereas for small proteins r.m.s.-values over 6 Ĺ mean that probably not even the backbone folding pattern is similar in both conformations.
The other version of the fitness function is a vector of several fitness components which will be explained in the following paragraphs. The vector fitness function is in the following way used to determine the candidates for the next generation. If there is an individual that has better (i.e. lower) values in each fitness component, then we take it. If there is then another individual with the same property over the remaining set of individuals then take it as well, and so on until no unambiguously better individuals are found. Then remove the worst individuals, i.e. those with higher values in each fitness component than any other individual. The remaining set of individuals is heuristically reduced until the exact number of individuals for the next generation is reached. This is done by iteratively removing an individual with the worst fitness value in a randomly selected fitness component. This multi-value vector fitness function includes the following components:

R.m.s. is the r.m.s.-deviation as described above. It can only be calculated in test runs with the protein conformation known beforehand. For the multi-value vector fitness function this measure was calculated for each individual to see how close the genetic algorithm came to the known structure. In these runs, however, the r.m.s.-measure was not used in the offspring selection process. Selection was done only based on the remaining eight fitness components and a Pareto selection algorithm which will be explained shortly.
is the torsion energy of a conformation based on the force field data of the Charmm force field v.21 with k and n as force field constants depending on the type of atom and
as the torsion angle:
is the van der Waals energy (also called Lennard-Jones potential) with A and B as force field constants depending on the type of atom and r as the distance between two atoms in one molecule. The indices i and j for the two atoms may not have identical values and each pair is counted only once:

is the electrostatic energy between two atoms with
as the partial charges of the two atoms i and j and r as the distance between them:

is a measure to promote compact folding patterns. The expected diameter of a protein can be estimated by a number of techniques. A penalty energy term is then calculated as follows:
Polar is a measure that favours polar residues on the protein surface but not in the core. Because all fitness contributions should be minimised a factor of minus one is required before the sum. The larger the distances of polar residues to the centre of the protein, the better a conformation and the more negative the value of polar. If residue i is one of k polar residues (any of Arg, Lys, Asn, Asp, Glu, or Gln) in a protein of length N residues and with s as is the centre of gravity, then the polar fitness contribution of residue i at position u is calculated as follows:

Hydro is a similar measure that favours hydrophobic residues (Ala, Val, Ile, Leu, Phe, Pro, Trp) in the core of a protein, whereas scatter promotes compact folds as it adds up the distances over all C
atoms irrespective of amino acid type:

Solvent is the solvent accessible surface of a conformation in [2]. It is calculated by a surface triangulation method.
Crippen is an empirical, statistical potential developed by G. Crippen [34]. It is summed over all pairs of atoms that interact within a certain distance.
Clash is a term that counts the number of atomic collisions where any two atoms come closer than 3.8Ĺ to each other. This fitness term can be used to approximate the effect of the van der Waals energy at small distances but at only a fraction of the computational cost:

LOCAL TWIST Operator. The LOCAL TWIST operator introduces local conformation changes by performing the ring closure algorithm for polymers [35] of N. Go and H. A. Scheraga for three consecutive amino acid residues (Figure 6). This algorithm was originally implemented in the RING.FOR program [36] in a general way that operated on six adjacent dihedral angles to bridge a gap with bonds of defined length and bond angle. The application of this algorithm for a polypeptide required translation of the program into the C programming language and some alterations to the program to account for the intermitting rigid
torsion angle.

Stereoprojection of a portion of three residues and an alternative fold found by the LOCAL TWIST operator.
The basic concept of the ring closure algorithm is to find suitable values for ![]()
that satisfy the following equation:
Here, u
(transposed) and e
are vectors and T and R are several translation and rotation matrices that define the constraints of a local conformation change. The angle
describes the rigid geometry of a peptide bond and ![]()
is the first backbone torsion angle in sequence to be modified. The search for suitable values of ![]()
involves repeated numerical approximations and is therefore rather time consuming. Hence, it was decided to distribute the LOCAL TWIST operator over several processors on a parallel computer [37] so that the calculations can be carried out in parallel for all individuals. In test runs with r.m.s.-deviation to the native conformation as the fitness function the LOCAL TWIST operator led to significant improvements in prediction accuracy and also to a substantial decrease in overall computation time.
Preferred Backbone Conformations. The MUTATE operator of the previous chapter is rather crude because it always uses the left boundary of one of the ten most frequently occurring 10°-intervals for a torsion angle. To improve the chance of selecting favourable values for the backbone torsion angles
and
a cluster analysis with a modified nearest neighbour algorithm [38] was performed for the main chain torsion angles of 66 proteins:
This algorithm first identifies small clusters with only few examples in more detail and then clusters more densely populated areas with a finer resolution than a single clustering would do in one pass. Figure 7 shows the centres of 34 clusters for arginine.

Secondary Structure. In addition to a more accurate selection of preferable main chain torsion angles predictions of secondary structure were used to reduce the search space. Two issues arise which must be considered:
The first question was addressed by assembling a consensus prediction from two different methods: the PHD artificial neural network [39] and a statistical analysis which uses information theory [40],[41]. For the second question there are two alternate solutions. One alternative is to use torsion angles of idealised
-helices and
-strands, another is to constrain torsion angles of the predicted secondary structures to an interval that includes the conformation with idealised geometry. The corresponding torsion angles are shown in Table 4.
![]()
, ![]()
and ![]()
, ![]()
are lower and upper values of the main chain torsion angles in the respective secondary structure. ![]()
, and ![]()
are values for an idealised standard geometry. For
-strands the values are an average of parallel and antiparallel strands.
Using the genetic algorithm as described in the previous sections produced the following results. Figure 8 shows the best individual of the final generation of a run with a population of 30 individuals, the LOCAL TWIST operator in effect and r.m.s.-deviation as the only fitness component [42]. For Crambin, the final r.m.s.-deviation of the conformation generated by the genetic algorithm is 1.08 Ĺ, which is well within the range of the best resolution from X-ray or NMR structure elucidation experiments. Another run with the same parameters produced an individual with a r.m.s.-deviation of 0.89 Ĺ. This demonstrates the suitability of the genetic algorithm approach to protein folding. Given a reliable fitness function the genetic algorithm is able to successfully traverse the torsion angle search space.

Other proteins that were used for test purposes of the genetic algorithm with an r.m.s.-fitness function are the trypsin inhibitor protein (Brookhaven database code 5PTI; final r.m.s.-deviation 1.48 Ĺ; Figure 9) and RNAse T1 (Brookhaven database code 2RNT, final r.m.s.-deviation 2.32 Ĺ; Figure 10).


The fact that none of structures produced in the runs with a r.m.s.-fitness function were completely identical to the native conformations is explained by the following three observations:
Hence, with an increasing number of generations it becomes more and more difficult to achieve any further improvement in the r.m.s.-fitness and the search stagnates at r.m.s.-deviation values between 0 - 2 Ĺ (Figure 11).

Another conclusion to draw from the above experiments with the r.m.s.-fitness function is that the fitness function is the crucial topic. This is clearly an unresolved issue and subject of ongoing research in protein engineering. Some aspects of the computational complexity have already been explained above. This situation led to the following experiments with the genetic algorithm and a multi-value vector fitness function.
Figure 12 shows the results of a run with the fitness components polar,
,
,
, hydro, Crippen and solvent. This individual had an r.m.s.-deviation of 6.27 Ĺ to the native conformation of Crambin. The genetic algorithm did not use the r.m.s.-deviation as part of the fitness function. Only the fitness components listed above were used to guide the genetic algorithm. Over the whole run some of the fitness components decreased along with r.m.s.-deviation (
, hydro, Crippen, solvent), as was expected. However, the other fitness components (polar,
,
, ) actually drove the genetic algorithm to conformations with less similarity to the native Crambin indicating that these propensities were no good indicators for the „nativeness“ of Crambin. In general, no better r.m.s.-values than around 6 Ĺ were detected in similar runs.

The following conformations were generated with the fitness components Crippen, clash, hydro and scatter. In addition, constraints on the secondary structures of Crambin were imposed by limiting the backbone angles to intervals between the upper and lower values of Table 4. Torsion angle
was constrained to 180°. For a general application the use of secondary structure constraints requires a highly accurate and reliable secondary structure prediction algorithm which unfortunately does not (yet) exist. Figure 13 shows the backbone of an individual generated by the genetic algorithm with the above mentioned fitness components and that has a r.m.s.-deviation to native Crambin of 4.36 Ĺ

Another run with the same fitness components was performed for trypsin inhibitor (Figure 14). The r.m.s.-deviation to native trypsin inhibitor is 6.65 Ĺ. This is worse than the result for Crambin in Figure 13 because the lower content of secondary structure in trypsin inhibitor implies less rigid constraints on the conformation. This means there are more degrees of freedom and therefore a larger search space to traverse.

Summarising these findings and those of the previous subsections we are led to the following conclusions.