Protein I-sites

I-sites are short sequence-structure motifs that are mined from the Protein Data Bank (PDB) that correlate strongly with three-dimensional structural elements. These sequence-structure motifs are used for the local structure prediction of proteins. Local structure can be expressed as fragments or as backbone angles. Locations in the protein sequence that have high confidence I-sites predictions may be the initiation sites of folding. I-sites have also been identified as discrete models for folding pathways. I-sites consist of about 250 motifs. Each motif has an amino acid profile, a fragment structure (represented by a "paradigm" fragment chosen from a protein in the PDB) and optionally, a 4-dimensional tensor of pairwise sequence covariance.

Construction of I-site Library
The sequence and structure database

The database initially consisted of 471 protein sequence families from the HSSP database, with an average of 47 aligned sequences per family. Each family contained a single known structure (parent) from the Brookhaven protein Data Bank. These were a subset of the PDBSelect-25 list, having no more than 25% sequence identity between any two alignments. Disordered loops were omitted. Gaps and insertions in the sequence were ignored.

Clustering of sequence segments

Each position in the database is described by a weighted amino acid frequency. A similarity measure in sequence space between a segment (p) and a cluster of segments (q) is defined as:

$$D_{pq} = \sum_{ij} log \left[ \dfrac{P_{ij}(p)+\alpha F_i}{(1+\alpha) F_i} \right] log \left[ \dfrac{\sum_{k \epsilon q } P_{ij}(k)+ \alpha ' F_i}{ ( N_q + \alpha ' ) F_i} \right]$$

where Pij(p) is the frequency of amino acid i in position j within the segment p. Nq is the number of sequence segments k in the cluster q. Fi is the frequency of amino acid type i in the database overall. The optimal values of a and a0 were determined empirically to be 0.5 and 15, respectively. Using this similarity measure, segments of a given length (3 to 15) were clustered via the k-means algorithm.

Assessing structure within a cluster; choice of paradigm

The structural similarity between any two peptide segments was evaluated using a combination of the RMS distance matrix error (dme):

$$dme = \sqrt{ \dfrac{\sum\limits_{i=1}^L \sum\limits_{j=i-5}^{i+5} ( \alpha_{i \rightarrow j}^{s1} - \alpha_{i \rightarrow j}^{s2} )^2 }{N} }$$

where ai->j is the distance between a-carbon atoms i and j in the segment s1 of length L, and the maximum deviation in backbone torsion angles (mda) over the length of the segment is given by:

$$mda(L) = max_{i=1,L-1} (\Delta \Phi_{i+1}, \Delta \Psi_i)$$

The paradigm structure for a cluster was chosen from the top-scoring 20 segments in the database as that with the smallest sum of mda values to the other 19. Other structural measures were tried before settling on these two: RMS deviation of a-carbon atoms (rmsd), dme alone, and a structural filter that looked for specific conserved contacts. The latter worked best in discriminating true and false positives, but could not be easily automated. The rmsd and dme were found to be poor discriminators of the two types of helix cap. The mda-dme combined filter best simulates the conserved contacts filter and is rapidly computed.