User:IoChris/sandbox

= Bio3D - an R package for protein structure analysis = Bio3D is a platform-independent R package that provides interactive tools for the analysis of biomolecular structure, sequence and simulation data, mainly deriving from the Protein Data Bank (PDB). It is therefore designed to manipulate protein structure files and It includes tools that cover several areas of the analysis from the initial exploration and manipulation of protein structures to basic molecular dynamics trajectory analysis, principal component analysis of multiple protein structures, normal mode analysis and more.

Working with multiple PDB structures
Bio3D was designed to facilitate the analysis of multiple protein structure files. The challenge of working with these structures is that they are usually different in their composition (i.e. different number of atoms, inconsistent residue numbering, missing atoms, differing conformations etc.) and it is these differences that are frequently of most interest.

Towards this direction, Bio3D contains extensive utilities to enable the reading and writing of sequence and structure data, sequence and structure alignment, homologous protein searches, structure annotation, atom selection, re-orientation, superposition, rigid core identification, clustering, torsion analysis, distance matrix analysis, structure and sequence conservation analysis, normal mode analysis across related structures, and principal component analysis of structural ensembles.

Constructing experimental structure ensembles for a protein family
It is quite common in biological research to explore a certain protein by identifying its closest relatives (homology). Comparing multiple structures of homologous proteins and carefully analyzing large multiple sequence alignments can help identify patterns of sequence and structural conservation and highlight conserved interactions that are crucial for protein stability and function. Bio3D provides a useful framework for such studies and can facilitate the integration of sequence, structure and dynamics data in the analysis of protein evolution.

Identification of similar structures
The search and retrieval for similar protein structures is performed with the help of BLAST. The protein of interest is aligned against the whole available online database of NCBI BLAST and the results come as a number of hits with their PDB IDs and their respective scores. They can be annotated on demand with a simple command and by specifying the headers of interest (i.e. resolution, ligandId, citation).

Multiple Sequence Alignment
The respective hits are downloaded from the PDB and split into separate chains. With the   function, the amino acid sequence is extracted from each structure and used for a multiple sequence alignment which will determine the residue for residue correspondences. The function makes use of the alignment program MUSCLE, which should be downloaded externally, however, it is also possible to run the alignment using the online web server. The generated alignment file can be inspected with an alignment viewer.

Comparative Structure Analysis
The detailed comparison of homologous protein structures can be used to infer pathways for evolutionary adaptation and, at closer evolutionary distances, mechanisms for conformational change. The Bio3D package employs both conventional methods for structural analysis (alignment, RMSD, difference distance matrix analysis, etc.) as well as refined structural superposition and principal component analysis (PCA) to facilitate comparative structure analysis.

Structure Superposition
Conventional structural superposition of proteins minimizes the root mean square difference between their full set of equivalent residues. This can be performed with functions  and. However, for certain applications such a procedure might be inappropriate. For example, in the comparison of a multi-domain protein that has undergone a hinge-like rearrangement of its domains, standard all atom superposition would result in an underestimate of the true atomic displacement by attempting superposition over all domains (whole structure superposition). A more appropriate and insightful superposition would be anchored at the most invariant region and hence more clearly highlight the domain rearrangement (sub-structure superposition).

The  function implements an iterated superposition procedure, where residues displaying the largest positional differences are identified and excluded at each round. The function returns an ordered list of excluded residues, from which the user can select a subset of core residues upon which superposition can be based.

Standard Structural Analysis
Bio3D contains functions to perform standard structural analysis, such as root mean-square deviation (RMSD), root mean-square fluctuation (RMSF), secondary structure, dihedral angles, difference distance matrices etc.

Root mean square deviation (RMSD)
RMSD is a standard measure of structural distance between coordinate sets. It is possible to examine the pairwise RMSD values and cluster the available structures based on these values. The results can be illustrated in a classical RMSD histogram and also a dendrogram which illustrates the clustering of the structures.

Root mean squared fluctuations (RMSF)
RMSF is another often used measure of conformational variance. The  function returns a vector of atom-wise (or residue-wise) variance instead of a single numeric value. Bio3D can detect gap containing positions and then exclude them from the subsequent RMSF calculation. The residues can be plotted against their respective RMSF distance.

Torsion/Dihedral analysis
The conformation of a polypeptide or nucleotide chain can be usefully described in terms of angles of internal rotation around its  constituent bonds. If a system of four atoms (A-B-C-D) is projected onto a plane normal to bond B-C, the angle between the projection of A-B and the projection of C-D is described as the torsion angle of A and D about bond B-C. By convention, angles are measured in the range -180 to +180, rather than from 0 to 360, with positive values defined to be in the clockwise direction. The results of the function  can be plotted into a basic Ramachandran plot.

Difference distance matrix analysis (DDM)
Distance matrices, also called distance plots or distance maps, are an established means of describing and comparing protein conformations. A distance matrix is a 2D representation of 3D structure that is independent of the coordinate reference frame and, ignoring  chirality, contains enough information to reconstruct the 3D Cartesian  coordinates. A contact map is essentially a simplified distance matrix.

Distance matrices can be calculated with the function  and contact maps with the function.

Principal Component Analysis (PCA)
After the identification of core residues and the subsequent superposition, PCA can be employed to examine the relationship between different structures based on their equivalent residues. PCA can be applied to both distributions of experimental structures and molecular dynamics trajectories and it can additionally provide considerable insight into the nature of conformational differences.

The resulting principal components (orthogonal eigenvectors) describe the axes of maximal variance of the distribution of structures. Projection of the distribution onto the space defined by the largest principal components results in a lower dimensional representation of the structural dataset. The percentage of the total mean square displacement (or variance) of atom positional fluctuations captured in each dimension is characterized by their corresponding eigenvalue. Experience suggests that 3–5 dimensions are often sufficient to capture over 70 percent of the total variance in a given family of structures. Thus, a handful of principal components are sufficient to provide a useful description while still retaining most of the variance in the original distribution. The function  will run a PCA for the available structures.

It is possible to create plots with the pairwise comparison of principal components or plot each component separately across the amino acid residue sequence and visualize each residue's contribution to the principal component(s).

To further aid interpretation, a PDB format trajectory can be produced that interpolates between the most dissimilar structures in the distribution along a given principal component. This involves dividing the difference between the conformers into a number of evenly spaced steps along the principal components, forming the frames of the trajectory. Such trajectories can be directly visualized in a molecular graphics program, such as VMD. Furthermore, the PCA results can be compared to those from simulations, as well as guiding dynamic network analysis, being analyzed for possible dynamic domains (with the  function), or used as initial seed structures for reaction path refinement methods such as Conjugate Peak Refinement.

Conformer Clustering in PC Space
Clustering structures in PC space can often enable one to focus on the relationships between individual structures in terms of their major structural displacements, with a controllable level of dynamic details (via specifying the number of principal components used in the clustering). For example, we can investigate how the X-ray structures of our protein of interest relate to the rest with respect to the major conformation change that covers over 65% of their structural variance. This can reveal functional relationships that are often hard to find by conventional pairwise methods such as the RMSD clustering detailed previously. Category:Structural proteins Category:Structural biology Category:Protein structure analysis Category:Protein structure