This section describes the application of a genetic algorithm to the problem of protein structure prediction [13],[14],[15], with a simple force field as the fitness function. It is a continuation of work presented earlier [16]. Similar research on genetic algorithms and protein folding was done independently by several groups world wide [17]. Genetic algorithms have been used to predict optimal sequences to fit structural constraints [18], to fold Crambin in the Amber force field [19] and Mellitin in an empirical, statistical potential [20], and to predict main chain folding patterns of small proteins based on secondary structure predictions [21].
In this section the individuals of the genetic algorithm are conformations of a protein and the fitness function is a simple force field. In the following, the representation formalism, the fitness function and the genetic operators are described. Then, the results of an ab initio prediction run and of an experiment for side chain placement for the protein Crambin will be discussed.
1.2.1.1 Representation Formalism
For every application of a genetic algorithm one has to decide on a representation formalism for the „genes“. In this application, the so-called hybrid approach is taken [8]. This means that the genetic algorithm is configured to operate on numbers, not bit strings as in the original genetic algorithm. A hybrid representation is usually easier to implement and also facilitates the use of domain specific operators. However, three potential disadvantages are encountered:
It is not the principal goal of this application to find the single optimal conformation of a protein based on a force field but to generate a small set of native-like conformations. For this task the genetic algorithm is an appropriate tool. For a hybrid representation of proteins one can use Cartesian coordinates, torsion angles, rotamers or a simplified model of residues.
For a representation in Cartesian coordinates the 3-dimensional coordinates of all atoms in a protein are recorded. This representation has the advantage of being easily converted to and from the 3-dimensional conformation of a protein. However, it has the disadvantage that a mutation operator would in most instances create invalid protein conformations where some atoms lie too far apart or collide. Therefore a filter is needed which eliminates invalid individuals. Because such a filter would consume a disproportionate large amount of CPU time a Cartesian coordinate representation considerably slows down the search process of a genetic algorithm.
Another representation model is by torsion angles. Here, a protein is described by a set of torsion angles under the assumption of constant standard binding geometries. Bond lengths and bond angles are taken to be constant and cannot be changed by the genetic algorithm. This assumption is certainly a simplification of the real situation where bond length and bond angle to some extent depend on the environment of an atom. However, torsion angles provide enough degrees of freedom to represent any native conformation with only small r.m.s. [22] deviations.
Special to the torsion angle representation is the fact that even small changes in the
(phi) /
(psi) angles can induce large changes in the overall conformation. This is of advantage when creating variability within a population at the beginning of a run. Figure 1 explains the definition of the torsion angles
,
,
(omega),
1 (chi1) and
2 (chi2). A small fragment taken from a hypothetical protein is shown. Two basic building blocks, the amino acids phenylalanine (Phe) and glycine (Gly), are drawn as wire frame models. Atoms are labelled with their chemical symbols. Bonds in bold print indicate the backbone. The labels of torsion angles are placed next to their rotatable bonds.

In the present work the torsion angle representation is used. Torsion angles of 129 proteins from the Brookhaven database [23] (PDB) were statistically analysed for the definition of the MUTATE operator. The frequency of each torsion angle in intervals of 10° was determined and the ten most frequently occurring intervals are made available for substitution of individual torsion angles by the MUTATE operator. At the beginning of the run, individuals were initialised with either a completely extended conformation where all torsion angles are 180° or by a random selection from the ten most frequently occurring intervals of each torsion angle. For the
torsion angle the constant value of 180° was used because of the rigidity of the peptide bond between the atoms C
and N
. A statistical analysis of
angles shows that with the exception of proline average deviations from the mean of 180° occur rather frequently up to 5°, and only in rare cases up to 15°.
The genetic operators in this application operate on the torsion angle representation but the fitness function requires a protein conformation to be expressed in Cartesian coordinates. For the implementation of a conversion program bond angles were taken from the molecular modelling software Alchemy [24] and bond lengths from the program Charmm [25]. Either a complete form with explicit hydrogen atoms or the so-called extended atom representation with small groups of atoms represented as „super-atoms“ can be calculated. One conformation of a protein is encoded as an array of structures of the C programming language. The number of structures equals the number of residues in the protein. Each structure includes a three letter identifier of the residue type and ten floating point numbers for the torsion angles
,
,
,
1,
2,
3,
4,
5,
6, and
7. For residues with less than seven side chain torsion angles the extra fields are filled with a default value. The main chain torsion angle
was kept constant at 180°.