Genetic Algorithms and Protein Folding


by Dr. Steffen Schulze-Kremer
Westfälische Strasse 56, D-10711 Berlin, FRG
E-mail: steffen@chemie.fu-berlin.de
Go to:     Table of Contents      Previous chapter 1.2.2.2

1.2.2.3 Results

Using the genetic algorithm as described in the previous sections produced the following results. Figure 8 shows the best individual of the final generation of a run with a population of 30 individuals, the LOCAL TWIST operator in effect and r.m.s.-deviation as the only fitness component [42]. For Crambin, the final r.m.s.-deviation of the conformation generated by the genetic algorithm is 1.08 Ĺ, which is well within the range of the best resolution from X-ray or NMR structure elucidation experiments. Another run with the same parameters produced an individual with a r.m.s.-deviation of 0.89 Ĺ. This demonstrates the suitability of the genetic algorithm approach to protein folding. Given a reliable fitness function the genetic algorithm is able to successfully traverse the torsion angle search space.

Figure 8. Crambin Predicted by R.M.S. Fitness Function

This conformation (solid line) with a r.m.s.-deviation of 1.08 Ĺ to native Crambin (dashed line) was obtained after 10,000 generations using the LOCAL TWIST, MUTATE, VARIATE and CROSSOVER operators and r.m.s.-deviation as the fitness function.

Other proteins that were used for test purposes of the genetic algorithm with an r.m.s.-fitness function are the trypsin inhibitor protein (Brookhaven database code 5PTI; final r.m.s.-deviation 1.48 Ĺ; Figure 9) and RNAse T1 (Brookhaven database code 2RNT, final r.m.s.-deviation 2.32 Ĺ; Figure 10).

Figure 9. Trypsin Inhibitor Predicted by R.M.S. Fitness Function

Stereoscopic superposition of the native conformation (dashed line) and one individual of the final generation (solid line). The r.m.s.-deviation is 1.48 Ĺ.

Figure 10. RNAse T1 Predicted by R.M.S. Fitness Function

Stereoscopic superposition of the native conformation (dashed line) and one individual of the final generation (solid line). The r.m.s.-deviation is 2.32.

The fact that none of structures produced in the runs with a r.m.s.-fitness function were completely identical to the native conformations is explained by the following three observations:

  1. The use of standard binding geometries for reconstructing 3D coordinates from a set of torsion angles can be the cause for structural alterations where the native conformation does not closely adhere to the theoretically derived ideal bond lengths and bond angles. In this case the best match will always have a r.m.s.-deviation of greater than zero.

  2. The operators MUTATE, VARIATE and CROSSOVER in theory cannot produce an exact match even if the target structure is known in detail. The reason for this comes from the representation formalism that these operators work on. If the current individual is already structurally similar to the desired protein then a single application of MUTATE or VARIATE is most likely to introduce mismatches of previously well fitting fragments and thus deteriorates the conformation. This happens because even if one bond becomes better aligned the rest of the protein towards the C-terminal swings away and increases the r.m.s.-deviation. CROSSOVER is not able to improve this situation by the same reason.

  3. Only the LOCAL TWIST operator can improve a fit locally without disturbing already well fitting fragments that surround the mutation site. However, the applicability of LOCAL TWIST is mathematically constrained: when starting from a less fitting conformation the optimal local improvement is not always be found in one pass. Sometimes it is even impossible to improve a local conformation at all.

Hence, with an increasing number of generations it becomes more and more difficult to achieve any further improvement in the r.m.s.-fitness and the search stagnates at r.m.s.-deviation values between 0 - 2 Ĺ (Figure 11).

Figure 11. Performance Comparison for the LOCAL TWIST Operator

This graph shows the course of six single experiments with the r.m.s-deviation as the fitness function. The individual with the best r.m.s.-deviation is plotted for each generation. The two thicker lines at the bottom have the LOCAL TWIST operator switched on after 3000 generations. Reproduction was done by the roulette wheel algorithm. The four runs without LOCAL TWIST had a population size of 54 individuals whereas the two runs with LOCAL TWIST had only 30.

Another conclusion to draw from the above experiments with the r.m.s.-fitness function is that the fitness function is the crucial topic. This is clearly an unresolved issue and subject of ongoing research in protein engineering. Some aspects of the computational complexity have already been explained above. This situation led to the following experiments with the genetic algorithm and a multi-value vector fitness function.

Figure 12 shows the results of a run with the fitness components polar, , , , hydro, Crippen and solvent. This individual had an r.m.s.-deviation of 6.27 Ĺ to the native conformation of Crambin. The genetic algorithm did not use the r.m.s.-deviation as part of the fitness function. Only the fitness components listed above were used to guide the genetic algorithm. Over the whole run some of the fitness components decreased along with r.m.s.-deviation (, hydro, Crippen, solvent), as was expected. However, the other fitness components (polar, , , ) actually drove the genetic algorithm to conformations with less similarity to the native Crambin indicating that these propensities were no good indicators for the „nativeness“ of Crambin. In general, no better r.m.s.-values than around 6 Ĺ were detected in similar runs.

Figure 12. Individual of the Final Generation of a Multi-Value Fitness Run

Only the fitness components polar, , , , hydro, Crippen and solvent were used to guide the genetic algorithm in this run. There is a vague similarity (r.m.s. 6.27 Ĺ) in the overall backbone fold of the generated individual (solid line) to native Crambin (dashed line).

The following conformations were generated with the fitness components Crippen, clash, hydro and scatter. In addition, constraints on the secondary structures of Crambin were imposed by limiting the backbone angles to intervals between the upper and lower values of Table 4. Torsion angle was constrained to 180°. For a general application the use of secondary structure constraints requires a highly accurate and reliable secondary structure prediction algorithm which unfortunately does not (yet) exist. Figure 13 shows the backbone of an individual generated by the genetic algorithm with the above mentioned fitness components and that has a r.m.s.-deviation to native Crambin of 4.36 Ĺ

Figure 13. Folding Crambin with Secondary Structure Constraints

The backbone of the predicted conformation (solid line) and Crambin (dashed line) have only a r.m.s.-deviation of 4.36 Ĺ. For this run only the fitness components Crippen, clash, hydro and scatter were used in the multi-value vector fitness function.

Another run with the same fitness components was performed for trypsin inhibitor (Figure 14). The r.m.s.-deviation to native trypsin inhibitor is 6.65 Ĺ. This is worse than the result for Crambin in Figure 13 because the lower content of secondary structure in trypsin inhibitor implies less rigid constraints on the conformation. This means there are more degrees of freedom and therefore a larger search space to traverse.

Figure 14. Backbone Folding of Trypsin Inhibitor

The backbone of the predicted conformation (solid line) and trypsin inhibitor (dashed line) have a r.m.s.-deviation of 6.65 Ĺ. For this run only the fitness components Crippen, clash, hydro and scatter were used. The comparatively bad performance of the genetic algorithm in comparison to the run on Crambin (Figure 13) is a result of the low content of secondary structure in trypsin inhibitor which increases the number of rotational degrees of freedom.

Summarising these findings and those of the previous subsections we are led to the following conclusions.

  1. Genetic algorithms proved to be an efficient search tool for 3-D representations of proteins. For a 3-D protein model with a simple, additive force field as fitness function and using a rather small population the genetic algorithm produced several individuals (i.e. protein conformations) of dissimilar topology but each with highly optimised fitness values.

  2. Given an appropriate fitness function (for test purposes the r.m.s-deviation to the a priori known conformation can be used) the genetic algorithm application described in this section finds the desired solution within only small deviations.

  3. The major problem lies in the fitness function. If there were one or a set of indicators that return „1“ for „the object is native protein conformation“ and „0“ for „the object is not a native protein conformation“ one could expect the genetic algorithm approach to deliver reasonably accurate ab initio predictions. However, neither mathematical models, empirical, semi-empirical or statistical force fields are yet accurate enough to reliably discriminate native from non-native conformations without additional constraints. Thus, the genetic algorithm produces (sub-)optimal conformations in a different sense than that of „nativeness“.

  4. Because secondary structure in nature and J. H. Holland’s building blocks in the genetic algorithm are analogous fundamental components for the construction of the individual, it was hoped to see secondary structures emerge as the building blocks in a subset of the population. This did not happen so far. One possible explanation is that the fitness functions used are not sensitive enough to detect and account for the structural benefits in secondary structures.

Go to:     Table of Contents      Previous chapter 1.2.2.2