Genetic Algorithms and Protein Folding


Revised Tue Jun 25 17:30:20 MET DST 1996
by Dr. Steffen Schulze-Kremer
Westfälische Strasse 56, D-10711 Berlin, FRG
E-mail: steffen@chemie.fu-berlin.de
Table of Contents
1 Evolutionary Computation (introduction)
1.1 Methodology
1.1.1 Genetic Algorithms
1.1.2 Evolution Strategy
1.2 Applications
1.2.1 Protein Folding Simulation by Force Field Optimisation
1.2.1.1 Representation Formalism
1.2.1.2 Fitness Function
1.2.1.3 Conformational Energy
1.2.1.4 Genetic Operators
1.2.1.5 Ab initio Prediction Results
1.2.1.6 Side Chain Placement
1.2.2 Multi-Criteria Optimisation of Protein Conformations
1.2.2.1 Vector Fitness Function
1.2.2.2 Specialised Genetic Operators
1.2.2.3 Results
Exercises
References

Evolutionary Computation

Evolutionary Computation is, like neural networks, an example par excellence for an information processing paradigm that was originally developed and exhibited by nature and later discovered by man who subsequently transformed the general principle into computational algorithms to be put to work on computers. Nature makes in an impressive way use of the principle of genetic heritage and evolution. Application of the simple concept of performance based reproduction of individuals („survival of the fittest“) led to the rise of well adapted organisms that can endure in a potentially adverse environment. Mutually beneficial interdependencies, co-operation and even apparently altruistic behaviour can emerge solely by evolution. The investigation of those phenomena is part of research in artificial life but cannot be dealt with in this book.

Evolutionary computation comprises the four main areas of Genetic Algorithms [1], Evolution Strategies [2], Genetic Programming [3]and Simulated Annealing [4]. Genetic algorithms and evolution strategies emerged at about the same time in the United States of America and Germany. Both techniques model the natural evolution process in order to optimise either a fitness function (evolution strategies) or the effort of generating subsequent, well-adapted individuals in successive generations (genetic algorithms). Evolution strategies in their original form were basically stochastic hill-climbing algorithms and used for optimisation of complex, multi-parameter objective functions that in practise cannot be treated analytically. Genetic algorithms in their original form were not primarily designed for function optimisation but to demonstrate the efficiency of genetic crossover in assembling successful candidates over complicated search spaces. Genetic programming takes the idea of solving an optimisation problem by evolution of potential candidates one step further in that not only the parameters of a problem but also the structure of a solution is subject to evolutionary change. Simulated Annealing is mathematically similar to evolution strategies. It was originally derived from a physical model of crystallisation. Only two individuals compete for the highest rank according to a fitness function and the decision about accepting suboptimal candidates is controlled stochastically.

All methods presented in this chapter are heuristic, i.e. they contain a random component. As a consequence (and in contrast to deterministic methods) it can never be guaranteed that the algorithm will find an optimal solution or even any solution at all. Evolutionary algorithms are therefore used preferably for applications were deterministic or analytic methods fail, for example, because the underlying mathematical model is not well defined or the search space is too large for systematic, complete search (n.-p. completeness). Another application area for evolutionary algorithms that is rapidly growing is the simulation of living systems starting with single cells and proceeding to organisms, societies or even whole economic systems [5], [6]. The goal of artificial life is not primarily to model biological life as accurately as possible but to investigate how our life or other, presumably different forms of life could have emerged from non-living components.

Work with evolutionary algorithms bears the potential for a philosophically and epistemologically interesting recursion. At the beginning, evolution emerged spontaneously in nature. Next, man discovers the principle of evolution and acquires knowledge on its mathematical properties. He defines genetic algorithms for computers. To complete the recursive cycle, computational genetic algorithms are applied to the very objects (DNA, proteins) of which they had been derived in the beginning. A practical example of such a meta-recursive application will be given below in the sections on protein folding. Figure 1 illustrates this interplay of natural and simulated evolution.

Fig. 1. Interplay of Natural and Simulated Evolution

Respond to nature’s challenges by her own means. The principle of evolution can be used to model structures and to simulate biomolecular reactions.

1.1 Methodology

1.1.1 Genetic Algorithms

The so-called genetic algorithm [7] is a heuristic method that operates on pieces of information like nature does on genes in the course of evolution. Individuals are represented by a linear string of letters of an alphabet (in nature nucleotides, in genetic algorithms bits, characters, strings, numbers or other data structures) and they are allowed to mutate, crossover and reproduce. All individuals of one generation are evaluated by a fitness function. Depending on the generation replacement mode a subset of parents and offspring enters the next reproduction cycle. After a number of iterations the population consists of individuals that are well adapted in terms of the fitness function. Although this setting is reminiscent of a classical function optimisation problem genetic algorithms were originally designed to demonstrate the benefit of genetic crossover in an evolutionary scenario, not for function optimisation. It cannot be proven that the individuals of a final generation contain an optimal solution for the objective encoded in the fitness function but it can be shown mathematically that the genetic algorithm optimises the effort of testing and producing new individuals if their representation permits development of building blocks (also called schemata). In that case, the genetic algorithm is driven by an implicit parallelism and generates significantly more successful progeny than random search. In a number of applications where the search space was too large for other heuristic methods or too complex for analytic treatment genetic algorithms produced favourable results [8].

The basic outline of a genetic algorithm is as follows:

  1. Initialise a population of individuals. This can be done either randomly or with domain specific background knowledge to start the search with promising seed individuals. Where available the latter is always recommended.

  2. Evaluate all individuals of the initial population.

  3. Generate new individuals. The reproduction probability for an individual is proportional to its relative fitness within the current generation. Reproduction involves domain specific genetic operators (Figure 2). Operations to produce new individuals are:

    Fig. 2. Genetic Operators for the Genetic Algorithm.

    Mutation exchanges one single bit. Variation modifies the encoded value by a small increment (or decrement). Crossover (single-point) exchanges a contiguous fragment of an individual. Analogously, more than one crossover point can be selected and only the fragments between those positions exchanged (two-point crossover for two crossover points; uniform crossover for as many crossover sites as positions in the individual).

  4. Select individuals for the new parent generation.

  5. Go back to step 2 until either a desired fitness value was reached or until a predefined number of iterations was performed.

The mathematical foundation of genetic algorithms is the schemata theorem of J. H. Holland. It makes a statement about the propagation of schemata (or building blocks) within all individuals of one generation. A schema is implicitly contained in an individual. Like individuals, schemata consist of bit strings (1, 0) and can be as long as the individual itself. In addition, schemata may contain „don’t care“ positions where it is not specified whether the bit is 1 or 0, i.e. schemata H are made from the alphabet {1, 0, #}. In other words, a schema is a generalisation of (parts of) an individual. For example, the individuals:

01010010100101010101110101010101 and

01011010100101110001110111010111

can be summarised by the schema:

0101#010100101#10#011101#10101#1

where all identical positions are retained and differing positions marked with a „#“ which stands for „don’t care“. The length of the above schema is 31 which is one minus the distance from the first to the last fixed symbol (1 or 0 but not #). The total number of different schemata of length l over an alphabet of cardinality k is . Because each string of length l contains schemata, with n individuals in one generation there are between and n schemata in the population (depending on the similarity of the n individuals). The order of a schema o(H) is the number of fixed positions (1 or 0 but not #).

Let [9] s(H, t) be the number of occurrences of a particular schema H in a population of n individuals at time t. The bit string of individual j gets then selected for reproduction with probability :

where is the fitness value of the j-th individual. The expected number of occurrences of schema H at time t+1 is:

with (H) as the average fitness of all individuals (strings ) that contain H. Crossover and mutation operators can destroy schemata during reproduction. The longer a single individual, the smaller the probability that a schema H will be involved in a crossover event. The longer a schema, i.e. the larger , the more likely is its destruction through recombination with another individual. Hence, for crossover the lower bound for the survival probability of a schema H is:

with L as the length of one whole individual. If we perform crossover stochastically at a frequency p the survival probability p becomes:

Summarising the effects of independent crossover and reproduction we arrive at the following equation for the expected occurrence of a schema H at time t+1:

This equation tells us that schemata increase over time proportional to their relative fitness and inverse proportional to their length. – Mutation can effect a schema H at each of its o(H) fixed positions with mutation probability p. Survival of a single constant position in a schema is then p = 1 - p and survival of the entire schema:

which for small p can be approximated by . Summarising the effects of independent mutation, crossover and variation we get the following formula for the expected count of a schema H:

Assuming a schema H could always outperform other schemata by a fraction b of the total mean fitness then this equation can be rewritten as:

This equation is of the form which says that the number of schemata better than average will exponentially increase over time. Effectively, many different schemata are sampled implicitly in parallel and good schemata will persist and grow. This is the basic rationale behind the genetic algorithm. It is suggested that if the (linear) representation of a problem allows the formation of schemata then the genetic algorithm can efficiently produce individuals that continuously improve in terms of the fitness function.

Let us examine the performance of the genetic algorithm on a simple application which is the search for the largest square product of an integer smaller than 32. Table 1 shows four initial individuals that were randomly generated. The bit strings of the individuals are decoded to unsigned integer values. The fitness function f (i) = i is used to assign a fitness value to each individual. Depending on their relative fitness values the reproduction probability (between 0% and 100%) for each individual is calculated and converted into the number of expected successors. Then the so-called roulette wheel algorithm is used to perform a stochastic selection based on the reproduction probability. Three particular schemata and their occurence and distribution over the four individuals is monitored.

Table 1. Genetic Algorithm at Work (Part I)

Continued in Table 2. See main text for explanation.

Table 2. Genetic Algorithm at Work (Part II)

Continued from Table 1. See main text for explanation.

Table 2 shows the situation after reproduction. The individuals selected for reproduction have been replicated according to their relative fitness. Crossover sites and mating partners have been assigned randomly. To keep this example simple mutation is not used here. After performing crossover the new fitness values of the individuals in the new population are calculated. The performance of the three schemata H, H, and H is also shown. Schema H is of low fitness because it implies that the decoded integer is smaller than 8. Therefore, this schema gets only a small chance for reproduction. Actually, H dies out as its only parent (the original individual #3) does not get selected for reproduction. Schemata H and H both have a reasonable chance for reproduction and are subsequently found in the new generation. Both average and best fitness values have significantly improved in the new generation.

In this example, we monitored only three schemata. There are, however, between = 32 and 4 × 32 = 128 schemata in this small population that were all implicitly evaluated in the same manner in parallel only at the small computational cost of copying and exchanging a few bit strings. The implicit arithmetics of finding and promoting the best schemata do not actually have to be carried out by the computer. They are, so to speak, side effects of the genetic paradigm. This implicit parallelism is the basic reason for the efficiency of genetic algorithms.

As an addendum, the stochastic universal sampling algorithm for minimisation (as opposed to maximization) of fitness values by J. E. Baker is implemented below in the C programming language. This implementation is especially elegant because the source code is quite short and the generation of only one random number between 0 and 1 is needed to perform a random selection among all individuals in one generation according to their individual fitness values.


k = 0;    /* k is an integer index of next individual to be selected */
ptr = Rand();                /* spin the roulette wheel; 0 < ptr < 1 */
Scaling_Factor = 1.0 / (Prev_Worst_Fitness - Average_Current_Fitness);
for (sum = i = 0; i < Popsize; i++) 
                                    /* Popsize is size of population */
{                 /* Fitness[i] is the fitness value of individual i */
   if (Fitness[i] < Prev_Worst_Fitness)
    Expected_Count = (Prev_Worst_Fitness - Fitness[i]) * Scaling_Factor;
   else Expected_Count = 0.0;
   for (Sum += Expected_Count; Sum > ptr; ptr++)
            sample[k++] = i; 
                      /* sample is an array of „Popsize“ integers. */
}          /* The value of each array element defines an individual. */

These instructions fill the array SAMPLE with integer values in a way that each individual with a fitness better than the average fitness of the current generation and better than the worst fitness in the last generation gets a chance of replication proportional to its fitness. Note the concerted increments of SUM and PTR in the inner FOR-loop.

1.1.2 Evolution Strategy

I. Rechenberg in 1973 developed the so-called evolution strategy [10]. The central idea of a computational evolution strategy is to create from one (or, in general, ) „parent(s)“ one (or ) „offspring“ individuals that have been modified by a mutation operator. and are usually small integers. As with the genetic algorithm, parents and offspring represent potential solutions to an optimisation problem encoded as a set of numerical parameters. Either only the offspring or both parents plus offspring individuals compete for survival into the next generation.

The mutation rate can be adjusted between successive generations to optimise the search progress. The so-called „1/5 rule“ predicts an optimal run time performance when about 20% of all mutations within one population produce „viable“ offspring. If more mutations produce successful offspring, the mutation rate is probably too small and larger mutations would be useful to cover more of the search space. If much less than 20% of all mutations lead to viable successors then it may be that the mutation rate is too large. A large mutation rate may have the effect that the individuals are scattered widely in search space and no convergence is found. In order to keep the mutation rate in an optimal region it was suggested to multiply (or divide, respectively) the current mutation rate by 1.25 between successive generations if the count of successful mutations falls below (exceeds) 20%. Mutation is a domain specific operator and can be implemented, for example, as a Gaussian distributed increment of a numerical parameter.

There are five subtle differences between a genetic algorithm and the evolution strategy.

  1. The evolution strategy was designed as a function optimiser but not so genetic algorithms which were originally developed to demonstrate the benefits of crossover in simulated evolution.

  2. Reproduction in genetic algorithms is proportional to fitness but not in evolution strategy.

  3. The genetic algorithm makes a distinction between genotype (the bit strings which are manipulated by genetic operators) and phenotype (the decoded value, e.g. an integer value interpreted as a parameter of the objective function i.e. the fitness) of an individual whereas in the original evolution strategy both coincide.

  4. In evolution strategy, both parents and offspring may compete to survive into the next generation but not so in the original genetic algorithm. The extension to the early model of evolution strategy to include parents was suggested by H.-P. Schwefel [11].

  5. Mutation is the main force to drive an evolution strategy whereas it is crossover for the genetic algorithm. Meanwhile, both approaches have converged and hybrid systems use both mutation and crossover.

In the recent past, both genetic algorithms and evolution strategy have gained wide public attention and many applications try to combine the best of both worlds [12].

1.2.1 Protein Folding Simulation by Force Field Optimisation

This section describes the application of a genetic algorithm to the problem of protein structure prediction [13],[14],[15], with a simple force field as the fitness function. It is a continuation of work presented earlier [16]. Similar research on genetic algorithms and protein folding was done independently by several groups world wide [17]. Genetic algorithms have been used to predict optimal sequences to fit structural constraints [18], to fold Crambin in the Amber force field [19] and Mellitin in an empirical, statistical potential [20], and to predict main chain folding patterns of small proteins based on secondary structure predictions [21].

In this section the individuals of the genetic algorithm are conformations of a protein and the fitness function is a simple force field. In the following, the representation formalism, the fitness function and the genetic operators are described. Then, the results of an ab initio prediction run and of an experiment for side chain placement for the protein Crambin will be discussed.

1.2.1.1 Representation Formalism

For every application of a genetic algorithm one has to decide on a representation formalism for the „genes“. In this application, the so-called hybrid approach is taken [8]. This means that the genetic algorithm is configured to operate on numbers, not bit strings as in the original genetic algorithm. A hybrid representation is usually easier to implement and also facilitates the use of domain specific operators. However, three potential disadvantages are encountered:

  1. Strictly speaking, the mathematical foundation of genetic algorithms holds only for binary representations, although some of the mathematical properties are also valid for a floating point representation.

  2. Binary representations run faster in many applications.

  3. An additional encoding/decoding process may be required to map numbers onto bit strings.

It is not the principal goal of this application to find the single optimal conformation of a protein based on a force field but to generate a small set of native-like conformations. For this task the genetic algorithm is an appropriate tool. For a hybrid representation of proteins one can use Cartesian coordinates, torsion angles, rotamers or a simplified model of residues.

For a representation in Cartesian coordinates the 3-dimensional coordinates of all atoms in a protein are recorded. This representation has the advantage of being easily converted to and from the 3-dimensional conformation of a protein. However, it has the disadvantage that a mutation operator would in most instances create invalid protein conformations where some atoms lie too far apart or collide. Therefore a filter is needed which eliminates invalid individuals. Because such a filter would consume a disproportionate large amount of CPU time a Cartesian coordinate representation considerably slows down the search process of a genetic algorithm.

Another representation model is by torsion angles. Here, a protein is described by a set of torsion angles under the assumption of constant standard binding geometries. Bond lengths and bond angles are taken to be constant and cannot be changed by the genetic algorithm. This assumption is certainly a simplification of the real situation where bond length and bond angle to some extent depend on the environment of an atom. However, torsion angles provide enough degrees of freedom to represent any native conformation with only small r.m.s. [22] deviations.

Special to the torsion angle representation is the fact that even small changes in the (phi) / (psi) angles can induce large changes in the overall conformation. This is of advantage when creating variability within a population at the beginning of a run. Figure 1 explains the definition of the torsion angles , , (omega), 1 (chi1) and 2 (chi2). A small fragment taken from a hypothetical protein is shown. Two basic building blocks, the amino acids phenylalanine (Phe) and glycine (Gly), are drawn as wire frame models. Atoms are labelled with their chemical symbols. Bonds in bold print indicate the backbone. The labels of torsion angles are placed next to their rotatable bonds.

Figure 1. Torsion Angles , , , 1 and 2.

In the present work the torsion angle representation is used. Torsion angles of 129 proteins from the Brookhaven database [23] (PDB) were statistically analysed for the definition of the MUTATE operator. The frequency of each torsion angle in intervals of 10° was determined and the ten most frequently occurring intervals are made available for substitution of individual torsion angles by the MUTATE operator. At the beginning of the run, individuals were initialised with either a completely extended conformation where all torsion angles are 180° or by a random selection from the ten most frequently occurring intervals of each torsion angle. For the torsion angle the constant value of 180° was used because of the rigidity of the peptide bond between the atoms C and N. A statistical analysis of angles shows that with the exception of proline average deviations from the mean of 180° occur rather frequently up to 5°, and only in rare cases up to 15°.

The genetic operators in this application operate on the torsion angle representation but the fitness function requires a protein conformation to be expressed in Cartesian coordinates. For the implementation of a conversion program bond angles were taken from the molecular modelling software Alchemy [24] and bond lengths from the program Charmm [25]. Either a complete form with explicit hydrogen atoms or the so-called extended atom representation with small groups of atoms represented as „super-atoms“ can be calculated. One conformation of a protein is encoded as an array of structures of the C programming language. The number of structures equals the number of residues in the protein. Each structure includes a three letter identifier of the residue type and ten floating point numbers for the torsion angles , , , 1, 2, 3, 4, 5, 6, and 7. For residues with less than seven side chain torsion angles the extra fields are filled with a default value. The main chain torsion angle was kept constant at 180°.

1.2.1.2 Fitness Function

In this application a simple steric potential energy function was chosen as the fitness function (i.e. the objective function to be minimised). It is very difficult to find the global optimum of a potential energy function because of the large number of degrees of freedom even for a protein of average size. In general, molecules with n atoms have 3n - 6 degrees of freedom. For the case of a medium-sized protein of 100 residues this amounts to:

((100 residues × approximately 20 atoms per residue) × 3) - 6 = 5994
degrees of freedom. Systems of equations with this number of variables are analytically intractable today. Empirical efforts to heuristically find the optimum are almost as difficult [26]. If there are no constraints for the conformation of a protein and only its primary structure is given the number of conformations for a protein of medium size (100 residues) can be approximated to:
(5 torsion angles per residue × 5 likely values per torsion angle) = 25.
This means that in the worst case 25 conformations would have to be evaluated to find the global optimum. This is clearly beyond the capacity of today's and tomorrow's super computers. As can be seen from a number of previous applications genetic algorithms were able to find sub-optimal solutions to problems with an equally large search space [27],[28],[29]. Sub-optimal in this context means that it cannot be proven that the solutions generated by the genetic algorithm do in fact include an optimal solution but that some of the results generated by the genetic algorithm practically surpassed any previously known solution. This can be of much help in non-polynomial complete problems where no analytical solution of the problem is available.

1.2.1.3 Conformational Energy

The steric potential energy function was adapted from the program Charmm. The total energy of a protein in solution is the sum of the expressions for (bond length potential), (bond angle potential), (torsion angle potential), (improper torsion angle potential), (van der Waals pair interactions), (electrostatic potential), (hydrogen bonds), and of two expressions for interaction with the solvent, and :

= + + + + + + + + .
Here we assume constant bond lengths and bond angles. The expressions for , and are therefore constant for different conformations of the same protein. The expression was omitted because it would have required to exclude the effect of hydrogen bonds from the expressions for and . This, however, was not done by the authors of Charmm in their version v.21 of the program. In all runs, folding was simulated in vacuum with no ligands or solvent, i.e. and are constant. This is certainly a crude simplification of the real situation but nevertheless more detailed than the 2-D protein model in the previous section. Thus, the potential energy function simplifies to:
= + + .
Test runs showed that if only the three expressions , and are used there would be not enough force to drive the protein to a compact folded state. An exact solution to this problem requires the consideration of entropy. The calculation of the entropy difference between a folded and unfolded state is based on the interactions between protein and solvent. Unfortunately, it is not yet possible to routinely calculate an accurate model of those interactions. It was therefore decided to introduce an ad hoc pseudo entropic term that drives the protein to a globular state. The analysis of a number of globular proteins reveals the following empirical relation between the number of residues and the diameter:

The pseudo entropic term for a conformation is a function of its actual diameter. The diameter is defined to be as the largest distance between any C_a atoms in one conformation. An exponential of the difference between actual and expected diameter is added to the potential energy if that difference is less than 15Ĺ. If the difference is greater than 15Ĺ a fixed amount of energy is added (10 kcal/mol) to avoid exponential overflow. If the actual diameter of an individual is smaller than the expected diameter, is set to zero. The net result is that extended conformations have larger energy values and are therefore less fit for reproduction than globular conformations.

Occasionally, if two atoms are very close the term can become very large. The maximum value for in this case is 10 kcal/mol and the expressions for and are not calculated. Runs were performed with the potential energy function as described above where lower fitness values mean fitter individuals and with a variant, where the four expressions , , and were given individual weights. The results were similar in all cases. Especially, scaling down the dominant effect of electrostatic interactions did not change the results (see below).

1.2.1.4 Genetic Operators

In order to combine individuals of one generation to produce new offspring nature as well as genetic algorithms apply several genetic operators. In the present work, individuals are protein conformations represented by a set of torsion angles under the assumption of constant standard binding geometries. Three operators are invented to modify these individuals: MUTATE, VARIATE and CROSSOVER. The decision about the application of an operator is made during run time and can be controlled by various parameters.

MUTATE. The first operator is the MUTATE operator. If MUTATE gets activated for a particular torsion angle, this angle will be replaced by a random choice of one of the ten most frequently occurring values for that type of residue. The decision, whether a torsion angle will be modified by MUTATE is made independently for each torsion angle in a protein. A random number between 0 and 1 is generated and if this number is greater than the MUTATE parameter at that time, MUTATE is applied. The MUTATE parameter can change dynamically during a run. The values that MUTATE can choose from come from a statistical analysis of 129 proteins from PDB. The number of instances in each of the 36 10°-intervals was counted for each torsion angle. The ten most frequent intervals, each represented by its left boundary, are available for substitution.

VARIATE. The VARIATE operator consists of three components: the 1°, 5° and 10° operator. Independently and after application of the MUTATE operator for each torsion angle in a protein two decisions are made: first, whether the VARIATE operator will be applied and, second, if so which of the three components shall be selected. The VARIATE operator increments or decrements (always an independent random chance of 1:1) the torsion angle by 1°, 5° or 10°. Care is taken that the range of torsion angles does not exceed the [-180°, 180°] interval. The probability of applying this operator is controlled by the VARIATE parameter which can change dynamically during run time. Similarly, three additional parameters control the probability for choosing among the three components. Alternatively, instead of three discrete increments a Gaussian uniformly distributed increment between 10° and +10° can be used.

CROSSOVER. The CROSSOVER operator has two components: the two point crossover and the uniform crossover. CROSSOVER is applied to two individuals independently of the MUTATE and VARIATE operators. First, individuals of the parent generation, possibly modified by MUTATE and VARIATE, are randomly grouped pairwise. For each pair, an independent decision is made whether or not to apply the CROSSOVER operator. The probability of this is controlled by a CROSSOVER parameter which can change dynamically during run time. If the decision is „no“, the two individuals are not further modified and added to the list of offspring. If the decision is „yes“, a choice between the two point crossover and the uniform crossover must be made. This decision is controlled by two other parameters that can also be changed during run time. The two point crossover selects randomly two residues on one of the individuals. Then the fragment between the two residues is exchanged with the corresponding fragment of the second individual. Alternatively, uniform crossover decides independently for each residue whether or not to exchange the torsion angles of that residue. The probability for an exchange is then always 50%.

Parameterization. As mentioned in the previous paragraphs, there are a number of parameters that control the run time behaviour of a genetic algorithm. The parameter values used for the experiments that will be presented in the „Results“ section below are summarised in Table 1. The main chain torsion angle was kept constant at 180°. The initial generation was created by a random selection of torsion angles from a list of the ten most frequently occurring values for each angle. Ten individuals are in one generation. The genetic algorithm was halted after 1000 generations. At the start of the run, the probability for a torsion angle to be modified by the MUTATE operator is 80%, at the end of the run it becomes 20%. In between the probability decreases linearly with the number of generations. In contrast, the probability of applying the VARIATE operator increases from 20% at the beginning to 70% at the end of the run. The 10° component of the VARIATE operator is dominant at the start of the run (60%), whereas it is the 1° component at the end (80%). Likewise, the chance of performing a CROSSOVER rises from 10% to 70%. At the beginning of the run mainly uniform CROSSOVER is applied (90%), at the end it is mainly two point CROSSOVER (90%). This parameter setting uses a small number of individuals but runs over a large number of generations. This keeps computation time low while allowing a maximum number of crossover events. At the beginning of the run MUTATE and uniform CROSSOVER are applied most of the time to create some variety in the population so that many different regions of the search space are covered. At the end of the run the 1° component of the VARIATE operator dominates the scene. This is intended for fine tuning those conformations that have survived the selection pressure of evolution so far.

Table 1. Run Time Parameters.

Generation Replacement. There are different ways how to select the individuals for the next generation. Given the constraint that the number of individuals should remain constant some individuals have to be discarded. Transition between generations can be done by total replacement, elitist replacement or steady state replacement. For total replacement only the newly created offspring enters the next generation and the parents of the previous generation are completely discarded. This has the disadvantage that a fit parent can be lost if it only once produces bad offspring. With elitist replacement all parents and offspring of one generation are sorted according to their fitness. If the size of the population is n, then the n fittest individuals are selected as parents for the following generation. This mode has been used here. Another variant is steady state replacement where two individuals are selected from the population based on their fitness and then modified by mutation and crossover. They are then used to replace their parents.

1.2.1.5 Ab initio Prediction Results

A prototype of a genetic algorithm with the representation, fitness function and operators as described above has been implemented. To evaluate the ab initio prediction performance of the genetic algorithm the sequence of Crambin was given to the program. Crambin is a plant seed protein from the cabbage Crambe abyssinica. Its structure was determined by W. A. Hendrickson and M. M. Teeter [30] to a resolution of 1.5Ĺ (Figures 2 and 3). Crambin has a strong amphiphilic character which makes its conformation especially difficult to predict. However, because of its good resolution and small size of 46 residues it was decided to use Crambin as a first candidate for prediction. The following structures are again displayed in stereo projection. If the observer manages to look cross eyed at the diagram in a way that superimposes both halves a 3-dimensional impression can be perceived.

Figure 2. Stereoprojection of Crambin with Side Chains.

Figure 3. Stereoprojection of Crambin without Side Chains.


Figure 4. Two Conformations Generated by the Genetic Algorithm.

Figure 4 shows two of the ten individuals in the last generation of the genetic algorithm. None of the ten individuals shows significant structural similarity to the native Crambin conformation. This can be confirmed by superimposing the generated structures with the native conformation. Table 2 shows the r.m.s. differences between all ten individuals and the native conformation. All values are in the range of 9Ĺ which rejects any significant structural homology.

Table 2. R.m.s. Deviations to Native Crambin

The ten individuals of the last generation were measured against the native conformation of Crambin. R.m.s. values of around 9Ĺ for a small protein as Crambin exclude any significant structural similarity.

Although the genetic algorithm did not produce native-like conformations of Crambin, the generated backbone conformations could be those of a protein, i.e. they have no knots or unreasonably protruding extensions. The conformational results alone would indicate a complete failure of the genetic algorithm approach to conformational search but let us have a look at the energies in the final generation (Table 3). All individuals have a much lower energy than native Crambin in the same force field. That means that the genetic algorithm actually achieved a substantial optimisation but that the current fitness function was no good indicator for „nativeness“ of a conformation.

Table 3. Steric Energies in the Last Generation

For each individual the van der Waals energy (), electrostatic energy (), torsion energy (), pseudo entropic energy () and the sum of all terms () is shown. For comparison the values for native Crambin in the same force field are listed.

It is obvious that all individuals generated by the genetic algorithm have a much higher electrostatic potential than native Crambin. There are three reasons for this.

The final generation of only ten individuals contained two fundamentally different families of structures (class 1: P1, P2, P4, P5, P6, P8, P9) and (class 2: P3, P7, P10). Members of one class have a r.m.s. deviation of about 2 among each other but differ from members of the other class by about 9.

Taking into account the small population size, the significant increase in total energy of the individuals generated by the GA and the fact that the final generation contained two substantially different classes of conformations with very similar energies one is lead to the conclusion that the search performance of the genetic algorithm was not that bad at all. What remains a problem is to find a better fitness function that actually guides the genetic algorithm to native-like conformations. As the only criterion currently known to determine native conformation is the free energy the difficulty of this approach becomes obvious. One possible way to cope with the problem of inadequate fitness functions is to combine other heuristic criteria together with force field components in a multi-value vector fitness function. Before we turn to that approach let us first examine the performance of the current version for side chain placement.

1.2.1.6 Side Chain Placement

Crystallographers often face the problem of positioning the side chains of a protein when the primary structure and the conformation of the backbone is known. At present, there is no method that automatically does side chain placement with sufficiently high accuracy for routine practical use. Although the side chain placement problem is conceptually easier than ab initio tertiary structure prediction it is still too complex for analytical treatment.

The genetic algorithm approach as described above can be used for side chain placement. The torsion angles , , and have simply to be kept constant for a given backbone. Side chain placement by the genetic algorithm was done for Crambin. For each five residues, a superposition of the native and predicted conformation is shown in stereo projection graphs in Figure 5. As can be seen, the predictions are quite well in agreement with the native conformation in most cases. The overall r.m.s. difference in this example is 1.86Ĺ. This is not as good as but comparable to the results from a simulated annealing approach [31] (1.65Ĺ) and a heuristic approach (1.48 ) [32].


Residues 1 - 5


Residues 6 - 10


Residues 11 - 15


Residues 16 - 20


Residues 21 - 25


Residues 26 - 30


Residues 31 - 35


Residues 36 - 40


Residues 41 - 46

Figure 5. Side Chain Placement Results

A spatial superposition in stereoscopic wire frame diagrams is shown for every five residues of Crambin and the corresponding fragment generated by a genetic algorithm. The amino acid sequence of Crambin in one letter code is
TTCCP SIVAR SNFNV CRLPG TPEAI CATYT GCIII PGATC PGDYA N

It must be emphasised that these runs were done without optimising either the force field parameters of the fitness function or the run time parameters of the genetic algorithm. From a more elaborate and fine-tuned experiment even better results should be expected.

1.2.2 Multi-Criteria Optimisation of Protein Conformations

In this section we will introduce additional fitness criteria for the protein folding application with genetic algorithms. The rationale behind is that a more of information about genuine protein conformations should improve the fitness function to guide the genetic algorithm towards native-like conformations. Some properties of protein conformations can be used as additional fitness components whereas others can be incorporated into genetic operators (e.g. constraints from the Ramachandran plot). For such an extended fitness function several incommensurable quantities will have to be combined: energy, preferred torsion angles, secondary structure propensities or distributions of polar and hydrophobic residues. This creates the problem of how to combine the different fitness contributions to arrive at the total fitness of a single individual. Simple summation of different components has the disadvantage that components with larger numbers would dominate the fitness function whether or not they are important or of any significance at all for a particular conformation. To cope with this difficulty individual weights for each of the components could be introduced. But this creates another problem. How should one determine useful values for these weights? As there is no general theory known for the proper weighting of each fitness component the only way is to try different combinations of values and evaluate them by their performance of a genetic algorithm on test proteins with known conformations. However, even for a small number of fitness components a large number of combinations of weights arises which requires as many test runs for evaluation. Also, „expensive“ fitness components as the van der Waals energy need considerable computation time. In this work [33] two measures were taken to deal with this situation:


1.2.2.1 Vector Fitness Function

In this application two versions of a fitness function are used. One version is a scalar fitness function that calculates the r.m.s.-deviation of a newly generated individual from the known conformation of the test protein. This geometric measure should guide the genetic algorithm directly to the desired solution but it is only available for proteins with a known conformation. R.m.s.-deviation is calculated as follows:

Here i is the index over all corresponding N atoms in the two structures to be compared, in this case the conformation of an individual (u) in the current population and the known, actual structure (v) of the test protein. The squares of the distances between the vectors u and v of corresponding atoms are summed and the square root is taken. The result is a measure of how much each atom in the individual deviates on average from its true position. R.m.s. values of 0 - 3 signify strong structural similarity; values of 4 - 6 Ĺ denote weak structural similarity whereas for small proteins r.m.s.-values over 6 Ĺ mean that probably not even the backbone folding pattern is similar in both conformations.

The other version of the fitness function is a vector of several fitness components which will be explained in the following paragraphs. The vector fitness function is in the following way used to determine the candidates for the next generation. If there is an individual that has better (i.e. lower) values in each fitness component, then we take it. If there is then another individual with the same property over the remaining set of individuals then take it as well, and so on until no unambiguously better individuals are found. Then remove the worst individuals, i.e. those with higher values in each fitness component than any other individual. The remaining set of individuals is heuristically reduced until the exact number of individuals for the next generation is reached. This is done by iteratively removing an individual with the worst fitness value in a randomly selected fitness component. This multi-value vector fitness function includes the following components:

R.m.s. is the r.m.s.-deviation as described above. It can only be calculated in test runs with the protein conformation known beforehand. For the multi-value vector fitness function this measure was calculated for each individual to see how close the genetic algorithm came to the known structure. In these runs, however, the r.m.s.-measure was not used in the offspring selection process. Selection was done only based on the remaining eight fitness components and a Pareto selection algorithm which will be explained shortly.

is the torsion energy of a conformation based on the force field data of the Charmm force field v.21 with k and n as force field constants depending on the type of atom and as the torsion angle:

is the van der Waals energy (also called Lennard-Jones potential) with A and B as force field constants depending on the type of atom and r as the distance between two atoms in one molecule. The indices i and j for the two atoms may not have identical values and each pair is counted only once:

is the electrostatic energy between two atoms with as the partial charges of the two atoms i and j and r as the distance between them:

is a measure to promote compact folding patterns. The expected diameter of a protein can be estimated by a number of techniques. A penalty energy term is then calculated as follows:

Polar is a measure that favours polar residues on the protein surface but not in the core. Because all fitness contributions should be minimised a factor of minus one is required before the sum. The larger the distances of polar residues to the centre of the protein, the better a conformation and the more negative the value of polar. If residue i is one of k polar residues (any of Arg, Lys, Asn, Asp, Glu, or Gln) in a protein of length N residues and with s as is the centre of gravity, then the polar fitness contribution of residue i at position u is calculated as follows:

Hydro is a similar measure that favours hydrophobic residues (Ala, Val, Ile, Leu, Phe, Pro, Trp) in the core of a protein, whereas scatter promotes compact folds as it adds up the distances over all C atoms irrespective of amino acid type:

Solvent is the solvent accessible surface of a conformation in [2]. It is calculated by a surface triangulation method.

Crippen is an empirical, statistical potential developed by G. Crippen [34]. It is summed over all pairs of atoms that interact within a certain distance.

Clash is a term that counts the number of atomic collisions where any two atoms come closer than 3.8Ĺ to each other. This fitness term can be used to approximate the effect of the van der Waals energy at small distances but at only a fraction of the computational cost:

1.2.2.2 Specialised Genetic Operators

LOCAL TWIST Operator. The LOCAL TWIST operator introduces local conformation changes by performing the ring closure algorithm for polymers [35] of N. Go and H. A. Scheraga for three consecutive amino acid residues (Figure 6). This algorithm was originally implemented in the RING.FOR program [36] in a general way that operated on six adjacent dihedral angles to bridge a gap with bonds of defined length and bond angle. The application of this algorithm for a polypeptide required translation of the program into the C programming language and some alterations to the program to account for the intermitting rigid torsion angle.

Figure 6. Backbone Conformation changed by LOCAL TWIST Operator.

Stereoprojection of a portion of three residues and an alternative fold found by the LOCAL TWIST operator.

The basic concept of the ring closure algorithm is to find suitable values for that satisfy the following equation:

Here, u (transposed) and e are vectors and T and R are several translation and rotation matrices that define the constraints of a local conformation change. The angle describes the rigid geometry of a peptide bond and is the first backbone torsion angle in sequence to be modified. The search for suitable values of involves repeated numerical approximations and is therefore rather time consuming. Hence, it was decided to distribute the LOCAL TWIST operator over several processors on a parallel computer [37] so that the calculations can be carried out in parallel for all individuals. In test runs with r.m.s.-deviation to the native conformation as the fitness function the LOCAL TWIST operator led to significant improvements in prediction accuracy and also to a substantial decrease in overall computation time.

Preferred Backbone Conformations. The MUTATE operator of the previous chapter is rather crude because it always uses the left boundary of one of the ten most frequently occurring 10°-intervals for a torsion angle. To improve the chance of selecting favourable values for the backbone torsion angles and a cluster analysis with a modified nearest neighbour algorithm [38] was performed for the main chain torsion angles of 66 proteins:

  1. Cluster all / pairs for each amino acid until 21 clusters are formed.

  2. Collect all clusters with less than 10 pairs and add the centre of each cluster to the set of / pairs to be used by the MUTATE operator.

  3. Repeat the clustering procedure with only the / pairs from the clusters with at least 10 pairs in step 2 and let the clustering program run again until 21 clusters are formed. The centres of all new clusters complete the list of / pairs that MUTATE uses when substituting individual torsion angles.

This algorithm first identifies small clusters with only few examples in more detail and then clusters more densely populated areas with a finer resolution than a single clustering would do in one pass. Figure 7 shows the centres of 34 clusters for arginine.

Figure 7. 34 / Clusters for Arginine

There are 14 small clusters of the first pass with less than 10 pairs (shown as boxes) and 20 large clusters of the remaining pairs in the second pass (triangles).

Secondary Structure. In addition to a more accurate selection of preferable main chain torsion angles predictions of secondary structure were used to reduce the search space. Two issues arise which must be considered:

  1. Which secondary structure prediction algorithm should one rely on?

  2. Which torsion angles should be used for the predicted secondary structures?

The first question was addressed by assembling a consensus prediction from two different methods: the PHD artificial neural network [39] and a statistical analysis which uses information theory [40],[41]. For the second question there are two alternate solutions. One alternative is to use torsion angles of idealised -helices and -strands, another is to constrain torsion angles of the predicted secondary structures to an interval that includes the conformation with idealised geometry. The corresponding torsion angles are shown in Table 4.

Table 4. Boundaries for Main Chain Torsion Angles in Secondary Structures.

, and , are lower and upper values of the main chain torsion angles in the respective secondary structure. , and are values for an idealised standard geometry. For -strands the values are an average of parallel and antiparallel strands.

1.2.2.3 Results

Using the genetic algorithm as described in the previous sections produced the following results. Figure 8 shows the best individual of the final generation of a run with a population of 30 individuals, the LOCAL TWIST operator in effect and r.m.s.-deviation as the only fitness component [42]. For Crambin, the final r.m.s.-deviation of the conformation generated by the genetic algorithm is 1.08 Ĺ, which is well within the range of the best resolution from X-ray or NMR structure elucidation experiments. Another run with the same parameters produced an individual with a r.m.s.-deviation of 0.89 Ĺ. This demonstrates the suitability of the genetic algorithm approach to protein folding. Given a reliable fitness function the genetic algorithm is able to successfully traverse the torsion angle search space.

Figure 8. Crambin Predicted by R.M.S. Fitness Function

This conformation (solid line) with a r.m.s.-deviation of 1.08 Ĺ to native Crambin (dashed line) was obtained after 10,000 generations using the LOCAL TWIST, MUTATE, VARIATE and CROSSOVER operators and r.m.s.-deviation as the fitness function.

Other proteins that were used for test purposes of the genetic algorithm with an r.m.s.-fitness function are the trypsin inhibitor protein (Brookhaven database code 5PTI; final r.m.s.-deviation 1.48 Ĺ; Figure 9) and RNAse T1 (Brookhaven database code 2RNT, final r.m.s.-deviation 2.32 Ĺ; Figure 10).

Figure 9. Trypsin Inhibitor Predicted by R.M.S. Fitness Function

Stereoscopic superposition of the native conformation (dashed line) and one individual of the final generation (solid line). The r.m.s.-deviation is 1.48 Ĺ.

Figure 10. RNAse T1 Predicted by R.M.S. Fitness Function

Stereoscopic superposition of the native conformation (dashed line) and one individual of the final generation (solid line). The r.m.s.-deviation is 2.32.

The fact that none of structures produced in the runs with a r.m.s.-fitness function were completely identical to the native conformations is explained by the following three observations:

  1. The use of standard binding geometries for reconstructing 3D coordinates from a set of torsion angles can be the cause for structural alterations where the native conformation does not closely adhere to the theoretically derived ideal bond lengths and bond angles. In this case the best match will always have a r.m.s.-deviation of greater than zero.

  2. The operators MUTATE, VARIATE and CROSSOVER in theory cannot produce an exact match even if the target structure is known in detail. The reason for this comes from the representation formalism that these operators work on. If the current individual is already structurally similar to the desired protein then a single application of MUTATE or VARIATE is most likely to introduce mismatches of previously well fitting fragments and thus deteriorates the conformation. This happens because even if one bond becomes better aligned the rest of the protein towards the C-terminal swings away and increases the r.m.s.-deviation. CROSSOVER is not able to improve this situation by the same reason.

  3. Only the LOCAL TWIST operator can improve a fit locally without disturbing already well fitting fragments that surround the mutation site. However, the applicability of LOCAL TWIST is mathematically constrained: when starting from a less fitting conformation the optimal local improvement is not always be found in one pass. Sometimes it is even impossible to improve a local conformation at all.

Hence, with an increasing number of generations it becomes more and more difficult to achieve any further improvement in the r.m.s.-fitness and the search stagnates at r.m.s.-deviation values between 0 - 2 Ĺ (Figure 11).

Figure 11. Performance Comparison for the LOCAL TWIST Operator

This graph shows the course of six single experiments with the r.m.s-deviation as the fitness function. The individual with the best r.m.s.-deviation is plotted for each generation. The two thicker lines at the bottom have the LOCAL TWIST operator switched on after 3000 generations. Reproduction was done by the roulette wheel algorithm. The four runs without LOCAL TWIST had a population size of 54 individuals whereas the two runs with LOCAL TWIST had only 30.

Another conclusion to draw from the above experiments with the r.m.s.-fitness function is that the fitness function is the crucial topic. This is clearly an unresolved issue and subject of ongoing research in protein engineering. Some aspects of the computational complexity have already been explained above. This situation led to the following experiments with the genetic algorithm and a multi-value vector fitness function.

Figure 12 shows the results of a run with the fitness components polar, , , , hydro, Crippen and solvent. This individual had an r.m.s.-deviation of 6.27 Ĺ to the native conformation of Crambin. The genetic algorithm did not use the r.m.s.-deviation as part of the fitness function. Only the fitness components listed above were used to guide the genetic algorithm. Over the whole run some of the fitness components decreased along with r.m.s.-deviation (, hydro, Crippen, solvent), as was expected. However, the other fitness components (polar, , , ) actually drove the genetic algorithm to conformations with less similarity to the native Crambin indicating that these propensities were no good indicators for the „nativeness“ of Crambin. In general, no better r.m.s.-values than around 6 Ĺ were detected in similar runs.

Figure 12. Individual of the Final Generation of a Multi-Value Fitness Run

Only the fitness components polar, , , , hydro, Crippen and solvent were used to guide the genetic algorithm in this run. There is a vague similarity (r.m.s. 6.27 Ĺ) in the overall backbone fold of the generated individual (solid line) to native Crambin (dashed line).

The following conformations were generated with the fitness components Crippen, clash, hydro and scatter. In addition, constraints on the secondary structures of Crambin were imposed by limiting the backbone angles to intervals between the upper and lower values of Table 4. Torsion angle was constrained to 180°. For a general application the use of secondary structure constraints requires a highly accurate and reliable secondary structure prediction algorithm which unfortunately does not (yet) exist. Figure 13 shows the backbone of an individual generated by the genetic algorithm with the above mentioned fitness components and that has a r.m.s.-deviation to native Crambin of 4.36 Ĺ

Figure 13. Folding Crambin with Secondary Structure Constraints

The backbone of the predicted conformation (solid line) and Crambin (dashed line) have only a r.m.s.-deviation of 4.36 Ĺ. For this run only the fitness components Crippen, clash, hydro and scatter were used in the multi-value vector fitness function.

Another run with the same fitness components was performed for trypsin inhibitor (Figure 14). The r.m.s.-deviation to native trypsin inhibitor is 6.65 Ĺ. This is worse than the result for Crambin in Figure 13 because the lower content of secondary structure in trypsin inhibitor implies less rigid constraints on the conformation. This means there are more degrees of freedom and therefore a larger search space to traverse.

Figure 14. Backbone Folding of Trypsin Inhibitor

The backbone of the predicted conformation (solid line) and trypsin inhibitor (dashed line) have a r.m.s.-deviation of 6.65 Ĺ. For this run only the fitness components Crippen, clash, hydro and scatter were used. The comparatively bad performance of the genetic algorithm in comparison to the run on Crambin (Figure 13) is a result of the low content of secondary structure in trypsin inhibitor which increases the number of rotational degrees of freedom.

Summarising these findings and those of the previous subsections we are led to the following conclusions.

  1. Genetic algorithms proved to be an efficient search tool for 3-D representations of proteins. For a 3-D protein model with a simple, additive force field as fitness function and using a rather small population the genetic algorithm produced several individuals (i.e. protein conformations) of dissimilar topology but each with highly optimised fitness values.

  2. Given an appropriate fitness function (for test purposes the r.m.s-deviation to the a priori known conformation can be used) the genetic algorithm application described in this section finds the desired solution within only small deviations.

  3. The major problem lies in the fitness function. If there were one or a set of indicators that return „1“ for „the object is native protein conformation“ and „0“ for „the object is not a native protein conformation“ one could expect the genetic algorithm approach to deliver reasonably accurate ab initio predictions. However, neither mathematical models, empirical, semi-empirical or statistical force fields are yet accurate enough to reliably discriminate native from non-native conformations without additional constraints. Thus, the genetic algorithm produces (sub-)optimal conformations in a different sense than that of „nativeness“.

  4. Because secondary structure in nature and J. H. Holland’s building blocks in the genetic algorithm are analogous fundamental components for the construction of the individual, it was hoped to see secondary structures emerge as the building blocks in a subset of the population. This did not happen so far. One possible explanation is that the fitness functions used are not sensitive enough to detect and account for the structural benefits in secondary structures.
A collection of articles that use genetic algorithms for / in biocomputing.
References
  1. J. Holland, Genetic algorithms and the optimal allocations of trials, SIAM Journal of Computing, vol 2, no 2, pp. 88 - 105, 1973.
  2. I. Rechenberg, Bioinik, Evolution und Optimierung, Naturwissenschaftiche Rundschau, vol 26, pp. 465 - 472, 1973.
  3. J. Koza, Genetic Programming, MIT Press, 1993.
  4. S. Kirkpatrick, C. D. Gelatt, Jr., M. P. Vecchi, Optimization by Simulated Annealing, Science, vol 220, no 4598, pp. 671-680, 1983.
  5. T. Jones, S. Forrest, An Introduction to SFI Echo, November 1993, Santa Fe Institute, 1660 Old Pecos Trail, Suita A, Santa Fe NM 87501, email terry@sanatfe.edu, forrest@cs.unm.edu. World Wide Web Server at ftp://alife.santafe.edu/pub/SOFTWARE.
  6. J. H. Holland, Echoing Emergence: Objectives, Rough Definitions and Speculations for Echo-class Models, in Integrative Themes, (G. Cowan, D. Pines, D. Melzner, Eds.), Santa Fe Institute Studies in the Science of Complexity, Proc. Vol XIX, Reading, MA, Addison-Wesley, 1993.
  7. J. H. Holland, Adaptation in Natural and Artificial Systems, 2nd Ed., MIT Press, 1992.
  8. L. Davis, (ed.) Handbook of Genetic Algorithms, Van Nostrand Reinhold, New York, 1991.
  9. D. E. Goldberg, Genetic Algorithms in Search, Optimization & Machine Learning, Addison-Wesley, 1989.
  10. I. Rechenberg, Evolutionsstrategie, Frommann-Holzboog, Stuttgart, 1973, 2nd Edition 1994.
  11. H.-P. Schwefel, Numerical Optimization Of Computer Models, Chichester, John Wiley, 1981, (originally published in 1977).
  12. World Wide Web Servers with more information and software at
    http://lautaro.fb10.tu-berlin.de, http://www.cis.ohio-state.edu/hypertext/faq/usenet/ai-faq/genetic/top.html and
    http://www-mitpress.mit.edu/jrnls-catalog/evolution.html.
  13. G. E. Schulz, R. H. Schirmer, Principles of Protein Structure, Springer Verlag, 1979.
  14. A. M. Lesk, Protein Architecture - A Practical Approach, IRL Press, 1991.
  15. C. Branden, J. Tooze, Introduction to Protein Structure, Garland Publishing New York, 1991.
  16. S. Schulze-Kremer, Genetic Algorithms for Protein Tertiary Structure Prediction, in Parallel Problem Solving from Nature II, (R. Männer, B. Manderick, eds.), North Holland, pp. 391-400, 1992.
  17. For more information or to get in touch with researchers using genetic algorithms send an email to one of the following mailing lists: ga-molecule@interval.com, ga-list-request@aic.nrl.navy.mil or to Melanie Mitchell at mm@santafe.edu who keeps an extensive bibliography on applications of genetic algorithms in chemistry. Alternatively, try a search for „genetic algorithm“ in gopher space, for example at gopher://veronica.sunet.se.
  18. T. Dandekar, P. Argos, Potential of genetic algorithms in protein folding and protein engineering simulations, Protein Engineering, vol 5, no 7, pp. 637-645, 1992.
  19. S. M. Le Grand, K. M. Merz, The application of the genetic algorithm to the minimization of potential energy functions, The Journal of Global Optimization, vol 3, pp.49-66, 1993.
  20. S. Sun, Reduced representation model of protein structure prediction: statistical potential and genetic algorithms, Protein Science, vol 2, no 5, pp. 762-785, 1993.
  21. T. Dandekar, P. Argos, Folding the Main Chain of Small Proteins with the Genetic Algorithm, Journal of Molecular Biology, vol 236, pp. 844-861, 1994.
  22. r.m.s. = root mean square deviation; two conformations are superimposed and the square root is calculated from the sum of the squares of the distances between corresponding atoms.
  23. F. C. Bernstein, T. F. Koetzle, G. J. B. Williams, E. F. Meyer Jr., M. D. Brice, J. R. Rodgers, O. Kennard, T. Shimanouchi, M. Tasumi, The Protein Data Bank: A Computer-based Archival File for Macromolecular Structures, Journal of Molecular Biology, 112, pp. 535-542, 1977.
  24. J. G. Vinter, A. Davis, M. R. Saunders, Strategic approaches to drug design. An integrated software framework for molecular modelling, Journal of Computer-Aided Molecular Design, vol 1, pp. 31-51, 1987.
  25. B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, M. Karplus, Charmm: A program for Macromolecular Energy, Minimization and Dynamics Calculations, Journal of Computational Chemistry, vol 4, no 2, pp. 187-217, 1983.
  26. J. T. Ngo, J. Marks, Computational complexity of a problem in molecular-structure prediction, Protein Engineering, vol 5, no 4, pp. 313-321, 1992.
  27. L. Davis, (ed.) Handbook of Genetic Algorithms, van Nostrand Reinhold, New York, 1991.
  28. C. B. Lucasius, G. Kateman, Application of Genetic Algorithms to Chemometrics, Proceedings 3rd International Conference on Genetic Algorithms, (J. D. Schaffer , ed.), Morgan Kaufmann Publishers, San Mateo, CA, pp. 170-176, 1989.
  29. P. Tuffery, C. Etchebest, S. Hazout, R. Lavery, A new approach to the rapid determination of protein side chain conformations, J. Biomol. Struct. Dyn., vol 8, no 6, pp. 1267-1289, 1991.
  30. W. A. Hendrickson, M. M. Teeter, Structure of the Hydrophobic Protein Crambin Determined directly from the Anomalous Scattering of Sulphur, Nature, vol 290, pp. 107, 1981.
  31. C. Lee, S. Subbiah, Prediction of protein side chain conformation by packing optimization, Journal of Molecular Biology, no 217, pp. 373-388, 1991.
  32. P. Tuffery, C. Etchebest, S. Hazout, R. Lavery, A new approach to the rapid determination of protein side chain conformations, J. Biomol. Struct. Dyn., vol 8, no 6, pp. 1267-1289, 1991.
  33. S. Schulze-Kremer, A. Levin, Search for Protein Conformations with a Parallel Genetic Algorithm: Evaluation of a Vector Fitness Function, Final Report and Technical Documentation of Protein Folding Application (Work package 1), European ESPRIT project # 6857 PAPAGENA, August 1994.
  34. N. M. Maiorov, G. M. Crippen, Contact potential that recognizes the correct folding of globular proteins, Journal of Molecular Biology, vol 227, pp. 876 - 888, 1992.
  35. N. Go, H. A. Scheraga, Ring Closure and Local Conformational Deformations of Chain Molecules, Macromolecules, vol 3, pp. 178-187, 1970.
  36. Quantum Chemical Exchange Program (QCEP) program No. QCMP 046.
  37. Intel Paragon with 98 x i860 processors owned by Parallab, University of Bergen, Norway.
  38. S. Y. Lu, K. S. Fu, A sentence-to-sentence clustering procedure for pattern analysis, IEEE Transactions on Systems, Man and Cybernatics SMC, vol 8, pp. 381 - 389, 1978.
  39. B. Rost, C. Sander, Prediction of protein secondary structure at better than 70% accuracy, Journal of Molecular Biology, vol 232, pp. 584 - 599, 1993.
  40. R.S. Cármenes, J.P. Freije, M.M. Molina, J.M. Martín, PREDICT7, a program for protein structure prediction, Biochem. Biophys. Res. Comm., vol 159, pp. 687-693, 1989.
  41. J. Garnier, D. J. Osguthorpe, B. Robson, Journal of Molecular Biology, vol 120, pp. 97-120, 1978.
  42. This is only available for test runs with already known protein conformations.