Protein fragment library

Protein backbone fragment libraries have been used successfully in a variety of structural biology applications, including homology modeling, de novo structure prediction,  and structure determination. By reducing the complexity of the search space, these fragment libraries enable more rapid search of conformational space, leading to more efficient and accurate models.

Motivation
Proteins can adopt an exponential number of states when modeled discretely. Typically, a protein's conformations are represented as sets of dihedral angles, bond lengths, and bond angles between all connected atoms. The most common simplification is to assume ideal bond lengths and bond angles. However, this still leaves the phi-psi angles of the backbone, and up to four dihedral angles for each side chain, leading to a worst case complexity of k6*n possible states of the protein, where n is the number of residues and k is the number of discrete states modeled for each dihedral angle. In order to reduce the conformational space, one can use protein fragment libraries rather than explicitly model every phi-psi angle.

Fragments are short segments of the peptide backbone, typically from 5 to 15 residues long, and do not include the side chains. They may specify the location of just the C-alpha atoms if it is a reduced atom representation, or all the backbone heavy atoms (N, C-alpha, C carbonyl, O). Note that side chains are typically not modeled using the fragment library approach. To model discrete states of a side chain, one could use a rotamer library approach.

This approach operates under the assumption that local interactions play a large role in stabilizing the overall protein conformation. In any short sequence, the molecular forces constrain the structure, leading to only a small number of possible conformations, which can be modeled by fragments. Indeed, according to Levinthal's paradox, a protein could not possibly sample all possible conformations within a biologically reasonable amount of time. Locally stabilized structures would reduce the search space and allow proteins to fold on the order of milliseconds.

Construction


Libraries of these fragments are constructed from an analysis of the Protein Data Bank (PDB). First, a representative subset of the PDB is chosen which should cover a diverse array of structures, preferably at a good resolution. Then, for each structure, every set of n consecutive residues is taken as a sample fragment. The samples are then clustered into k groups, based upon how similar they are to each other in spatial configuration, using algorithms such as k-means clustering. The parameters n and k are chosen according to the application (see discussion on complexity below). The centroids of the clusters are then taken to represent the fragment. Further optimization can be performed to ensure that the centroid possesses ideal bond geometry, as it was derived by averaging other geometries.

Because the fragments are derived from structures that exist in nature, the segment of backbone they represent will have realistic bonding geometries. This helps avoid having to explore the full space of conformation angles, much of which would lead to unrealistic geometries.

The clustering above can be performed without regard to the identities of the residues, or it can be residue-specific. That is, for any given input sequence of amino acids, a clustering can be derived using only samples found in the PDB with the same sequence in the k-mer fragment. This requires more computational work than deriving a sequence-independent fragment library but can potentially produce more accurate models. Conversely, a larger sample set is required, and one may not achieve full coverage.

Example use: loop modeling


In homology modeling, a common application of fragment libraries is to model the loops of the structure. Typically, the alpha helices and beta sheets are threaded against a template structure, but the loops in between are not specified and need to be predicted. Finding the loop with the optimal configuration is NP-hard. To reduce the conformational space that needs to be explored, one can model the loop as a series of overlapping fragments. The space can then be sampled, or if the space is now small enough, exhaustively enumerated.

One approach for exhaustive enumeration goes as follows. Loop construction begins by aligning all possible fragments to overlap with the three residues at the N terminus of the loop (the anchor point). Then all possible choices for a second fragment are aligned to (all possible choices of) the first fragment, ensuring that the last three residues of the first fragment overlap with the first three residues of the second fragment. This ensures that the fragment chain forms realistic angles both within the fragment and between fragments. This is then repeated until a loop with the correct length of residues is constructed.

The loop must both begin at the anchor on the N side and end at the anchor on the C side. Each loop must therefore be tested to see if its last few residues overlap with the C terminal anchor. Very few of these exponential numbers of candidate loops will close the loop. After filtering out loops that don't close, one must then determine which loop has the optimal configuration, as determined by having the lowest energy using some molecular mechanics force field.

Complexity
The complexity of the state space is still exponential in the number of residues, even after using fragment libraries. However, the degree of the exponent is reduced. For a library of F-mer fragments, with L fragments in the library, and to model a chain of N residues overlapping each fragment by 3, there will be L[N/(F-3)]+1 possible chains. This is much less than the KN possibilities if explicitly modeling the phi-psi angles as K possible combinations, as the complexity grows at a degree smaller than N.

The complexity increases in L, the size of the fragment library. However, libraries with more fragments will capture a greater diversity of fragment structures, so there is a trade off in the accuracy of the model vs the speed of exploring the search space. This choice governs what K is used when performing the clustering.

Additionally, for any fixed L, the diversity of structures capable of being modeled decreases as the length of the fragments increases. Shorter fragments are more capable of covering the diverse array of structures found in the PDB than longer ones. Recently, it was shown that libraries of up to length 15 are capable of modeling 91% of the fragments in the PDB to within 2.0 angstroms.