User:Dhruv Ch/Protein databases

= Protein Databases = Protein databases have become a crucial part of modern biology. Huge amounts of data for protein structures, functions, and particularly sequences are being generated. Searching databases is often the first step in the study of a new protein. Comparison between proteins or between protein families provides information about the relationship between proteins within a genome or across different species, and hence offers much more information than can be obtained by studying only an isolated protein. In addition, secondary databases derived from experimental databases are also widely available. These databases reorganize and annotate the data or provide predictions. The use of multiple databases often helps researchers understand the structure and function of a protein. Although some protein databases are widely known, they are far from being fully utilized in the protein science community. This unit provides a starting point for readers to explore the potential of protein databases on the Internet.

Keywords: Bioinformatics, Biological Databases, Protein Analysis, Protein Modeling INTRODUCTION Protein databases have become a crucial part of modern biology. Huge amounts of data for protein structures, functions, and particularly sequences are being generated. These data cannot be handled without using computer databases. Searching databases is often the first step in the study of a new protein. Without the prior knowledge obtained from such searches, known information about the protein could be missed, or an experiment could be repeated unnecessarily. Comparison between proteins and protein classification provide information about the relationship between proteins within a genome or across different species, and hence offer much more information than can be obtained by studying only an isolated protein. In this sense, protein comparison through databases allows one to view life as a forest instead of individual trees. In addition, secondary databases derived from experimental databases are also widely available. These databases reorganize and annotate the data or provide predictions. The use of multiple databases often helps researchers understand evolution, structure, and function of a protein.

Protein databases are especially powered by the Internet. Unlike traditional media, such as the CD-ROM, the Internet allows databases to be easily maintained and frequently updated with minimum cost. Researchers with limited resources can afford to set up their own databases and disseminate their data quickly. Notably, many small databases on specific types of proteins, such as the EF-Hand Calcium-Binding Proteins Data Library (http://structbio.vanderbilt.edu/cabp_database/), are widely available. Users worldwide can easily access the most up-to-date version through a user-friendly interface. Most protein databases have interactive search engines so that users can specify their needs and obtain the related information interactively. Many protein databases also allow submitters to deposit data, and database servers can check the format of the data and provide immediate feedback.

Although some protein databases are widely known, they are far from being fully utilized in the protein science community. This unit provides a starting point for readers to explore the potential of protein databases on the Internet. Databases for different aspects of proteins are discussed with the focus on sequence, structure, and family. The strengths and weaknesses of the databases are addressed. For Web addresses of the databases discussed in this unit, see Internet Resources and Table 19.4.1. From hundreds of on-line protein databases, several major databases are discussed as examples to illustrate their features and how they can be used effectively. Most other protein databases can be explored in a similar way. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3265122/<\ref>

Protein Database
A protein structure database is a Database that is modeled around the various experimentally determined protein structures. The aim of most protein structure databases is to organize and annotate the protein structures, providing the biological community access to the experimental data in a useful way. Data included in protein structure databases often includes three-dimensional coordinates as well as experimental information, such as unit cell dimensions and angles for x-ray crystallography determined structures. Though most instances, in this case either proteins or a specific structure determinations of a protein, also contain sequence information and some databases even provide means for performing sequence based queries, the primary attribute of a structure database is structural information, whereas sequence databases focus on sequence information, and contain no structural information for the majority of entries. Protein structure databases are critical for many efforts in computational biology such as structure based drug design, both in developing the computational methods used and in providing a large experimental dataset used by some methods to provide insights about the function of a protein.

The Protein Data Bank
The Protein Data Bank (PDB) was established in 1971 as the central archive of all experimentally determined protein structure data. Today the PDB is maintained by an international consortia collectively known as the Worldwide Protein Data Bank (wwPDB). The mission of the wwPDB is to maintain a single archive of macromolecular structural data that is freely and publicly available to the global community.

Database of Macromolecular Movements
The Database of Macromolecular Motions (molmovdb) is a bioinformatics database and software-as-a-service tool that attempts to categorize macromolecular motions, sometimes also known as conformational change. It was originally developed by Mark B. Gerstein, Werner Krebs, and Nat Echols in the Molecular Biophysics & Biochemistry Department at Yale University.

It attempts to systematize all instances of protein and nucleic acid movement for which there is at least some structural information. At present it contains >120 motions, most of which are of proteins. The database contains plausible representations for motion pathways, derived from restrained 3D interpolation between known endpoint conformations. These pathways can be viewed in a variety of movie formats, and the database is associated with a server that can automatically generate these movies from submitted coordinates.

Dynameomics
Dynameomics is a continuing project in the Daggett group to characterize the native state dynamics and the folding / unfolding pathway of representatives from all known protein folds by molecular dynamics simulation. It harbours molecular dynamics simulations of the native state and unfolding pathways of over 2000 protein/peptide systems (approximately 11,000 independent simulations) representing the majority of folds in globular proteins. These data are stored and organized in such a manner which can be mined to obtain both general and specific information about the dynamics and folding/unfolding of proteins, relevant subsets thereof, and individual proteins.

JenaLib
The Jena Library of Biological Macromolecules (JenaLib) is aimed at a better dissemination of information on analysis. It provides access to all structure entries deposited at the Protein Data Bank (PDB) or at the Nucleic Acid Databank (NDB). In addition, basic information on the architecture of biopolymer coordinates is available. This includes:

(1) Atlas pages and entry lists.

(2) PDB sequence information extracted from atomic coordinates.

(3) PDB/UniProt sequence alignments that clearly indicate gaps, mutations, numbering irregularities and modified residues.

(4) Integration of data on single amino acid polymorphisms (SAPs), PROSITE motifs, exon structure and SCOP/CATH/Pfam domains with PDB, GO and taxonomy information.

(5) Display of these data in the sequence/alignment viewer and in the Jmol based molecule viewer Jena3D ; in the latter case both for asymmetric and biological units.

(6) A QuickSearch option that allows searching for PDB/NDB code, UniProt ID/accession number and other search terms in one input field.

(7) A sequence homology search (BLAST) and pattern search options.

(8) SCOP/CATH/Pfam tree browsers.

ModBase
ModBase is a database of annotated comparative protein structure models, containing models for more than 3.8 million unique protein sequences. Models are created by the comparative modeling pipeline ModPipe which relies on the MODELLER program. ModBase is developed in the laboratory of Andrej Sali at UCSF. ModBase models are also accessible through the Protein Model Portal for fold assignment, sequence–structure alignment, model building and model assessment. ModBase currently contains 10 355 444 reliable models for domains in 2 421 920 unique protein sequences. ModBase allows users to update comparative models on demand, and request modeling of additional sequences through an interface to the ModWeb modeling server. ModBase models are available through the ModBase interface as well as the Protein Mod assessment.

OCA
A browser-database for protein structure/function - The OCA integrates information from KEGG, OMIM, PDBselect, Pfam, PubMed, SCOP, SwissProt, and others. Its a powerful alternative mechanism for searching the world structure database in the Protein Datbilaye. OCA provides rich content annotation on structure and function, generating dynamic links to these external sources. This database offers simple search, FASTA search or many options for additional searches. It also allows the user to save the generated search results.

PDBsum
PDBsum is a database that provides an overview of the contents of each 3D macromolecular structure deposited in the Protein Data Bank. The original version of the database was developed around 1995 by Roman Laskowski and collaborators at University College London. As of 2014, PDBsum is maintained by Laskowski and collaborators in the laboratory of Janet Thornton at the European Bioinformatics Institute (EBI).

It includes images of the structure, annotated plots of each protein chain’s secondary structure detailed structural analyses generated by the PROMOTIF program, summary PROCHECK results and schematic diagrams of protein–ligand and protein–DNA interactions. RasMol scripts highlight key aspects of the structure, such as the protein’s domains, PROSITE patterns and protein–ligand interactions, for interactive viewing in 3D. Numerous links take the user to related sites. PDBsum is updated whenever any new structures are released by the PDB and is freely accessible.

PDBTM
The Protein Data Bank of Transmembrane Proteins is the comprehensive and up-to-date trans membrane protein selection of the Protein Data Bank (PDB). PDBTM database is maintained at the Institute of Enzymology by the Membrane Protein Bioinformatics Research Group. The PDBTM database was created by scanning all PDB entries with the TMDET algorithm that is able to distinguish between transmembrane and monograms membrane proteins using their 3D atomic coordinates only. The TMDET algorithm can locate the spatial positions of transmembrane proteins in lipid bilayer. Since its release in 2004 numerous exotic transmembrane protein structure have been solved and the database entries have increased from 400 to 17000.

ProtCID
The Protein Common Interface Database (ProtCID) is a database of similar protein-protein interfaces in crystal structures of homologous proteins.

Its main goal is to identify and cluster homodimeric and heterodimeric interfaces observed in multiple crystal forms of homologous proteins. Such interfaces, especially of non-identical proteins or protein complexes, have been associated with biologically relevant interactions.

A common interface in ProtCID indicates chain-chain interactions that occur in different crystal forms. All protein sequences of known structure in the Protein Data Bank (PDB) are assigned a ”Pfam chain architecture”, which denotes the ordered Pfam assignments for that sequence, e.g. (Pkinase) or (Cyclin_N)_(Cyclin_C). Homodimeric interfaces in all crystals that contain a particular architecture are compared, regardless of whether there are other protein types in the crystals. All interfaces between two different Pfam architectures in all PDB entries that contain them are also compared (e.g., (Pkinase) and (Cyclin_N)_(Cyclin_C) ). For both homodimers and heterodimers, the interfaces are clustered into common interfaces based on a similarity score.

ProtCID reports the number of crystal forms that contain a common interface, the number of PDB entries, the number of PDB and PISA biological assembly annotations that contain the same interface, the average surface area, and the minimum sequence identity of proteins that contain the interface. ProtCID provides an independent check on publicly available annotations of biological interactions for PDB entries.

Protein
The NIH protein database is a collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq and Third Party Annotation, as well as records from SwissProt, PIR, PRF, and PDB.

Proteopedia
Proteopedia is a 3D encyclopedia of proteins and other molecules. The site contains a page for every entry in the Protein Data Bank (>130,000 pages), as well as pages that are more descriptive of protein structures in general such as acetylcholinesterase, hemoglobin, and the photosystem II with a Jmol view that highlights functional sites and ligands. Currently, Proteopedia has 148,468 articles and contains one page for every entry in the World Wide Protein Data Bank. It employs a scene-authoring tool so that users do not have to learn JSmol script language tk create customized molecular scenes.

123D+
123D+ threads a sequence through a set of 3D structures. It combines sequence profiles, secondary structure prediction, and contact capacity potentials to find the most compatible fold among the 3D structures, and the best alignment of the sequence with that fold.

Columba-DB: Protein Structure Annotation
This meta-server provides an integrated summary with links to details for information from the PDB, KEGG, ENZYME, ExPASy, DSSP, CATH, SCOP, SwissProt, NCBI Taxonomy, GO, and PISCES. The database can be searched using either keyword search or data source-specific web forms.Users can thus quickly select and download PDB entries that, are classified as containing a certain CATH architecture, are annotated as having certain molecular function in the Gene Ontology, and whose structures have a resolution under a defined threshold. The results of queries are provided in both machine readable extensible language and human-readable format. The structures themselves can be viewed interactively on the web.

CASTp
CASTp is a server that identifies pockets and cavities in proteins, and quantitates their volumes. Atoms lining each pocket or cavity can be displayed in Chime, RasMol, or MAGE. CASTp can be used to study surface features and functional regions of proteins. It includes a graphical interface, flexible interactive visualization, as well as on-the-fly calculation for user uploaded structures.

Conformational Epitope Prediction Server
It predicts possible antigenic epitopes on surfaces of protein antigen structures submitted. Displays predicted epitopes in Jmol. CEP server provides a web interface to the conformational epitope prediction algorithm developed in-house. The algorithm, apart from predicting conformational epitopes, also predicts antigenic determinants and sequential epitopes. The epitopes are predicted using 3D structure data of protein antigens, which can be visualized graphically. The algorithm employs structure-based Bioinformatics approach and solvent accessibility of amino acids in an explicit manner.

DisEMBL
DisEMBL is a computational tool for prediction of disordered/unstructured regions within a protein sequence. As no clear definition of disorder exists, it has developed parameters based on several alternative definitions, and introduced a new one based on the concept of "hot loops'', i.e. coils with high temperature factors. Avoiding potentially disordered segments in protein expression constructs can increase expression, foldability and stability of the expressed protein. DisEMBL is thus useful for target selection and the design of constructs as needed for many biochemical studies, particularly structural biology and structural genomics projects.

DisProt
"Database of Protein Disorder (DisProt) is a curated database that provides information about proteins that lack fixed 3D structure in their putatively native states, either in their entirety or in part." It is a database of experimental evidences of disorder manually collected from literature. Each evidence is identified by one experiment, the corresponding paper and the position in the sequence. When multiple experiments are available in a single paper, DisProt reports multiple evidences (even if experiments are about the same region).

FSSP
FSSP (families of structurally similar proteins) is a database of structural alignments of proteins in the Protein Data Bank (PDB). The database currently contains an extended structural family for each of 330 representative protein chains. Each data set contains structural alignments of one search structure with all other structurally significantly similar proteins in the representative set (remote homologs, < 30% sequence identity), as well as all structures in the Protein Data Bank with 70-30% sequence identity relative to the search structure (medium homologs). Very close homologs (above 70% sequence identity) are excluded as they rarely have marked structural differences. The alignments of remote homologs are the result of pairwise all-against-all structural comparisons in the set of 330 representative protein chains. All such comparisons are based purely on the 3D co-ordinates of the proteins and are derived by automatic (objective) structure comparison programs. The significance of structural similarity is estimated based on statistical criteria.

SCOP and SCOP2
The Structural Classification of Proteins (SCOP) database provides a detailed and comprehensive description of the relationships of known protein structures. The classification is on hierarchical levels: the first two levels, family and superfamily, describe near and distant evolutionary relationships; the third, fold, describes geometrical relationships. The distinction between evolutionary relationships and those that arise from the physics and chemistry of proteins is a feature that is unique to this database so far. SCOP2 is a successor of SCOP. Similarly to SCOP, the main focus of SCOP2 is on proteins that are structurally characterized and deposited in the PDB. Proteins are organized according to their structural and evolutionary relationships, but, in contrast to SCOP, instead of a simple tree-like hierarchy these relationships form a complex network of nodes. Each node represents a relationship of a particular type and is exemplified by a region of protein structure and sequence. SCOPe is a database developed at the Berkeley Lab and UC Berkeley that extends SCOP (version 1). SCOPe classifies many structures released since SCOP 1.75 through a combination of automation and manual curation, and corrects some errors, aiming to have the same accuracy as the fully hand-curated SCOP releases. SCOPe also incorporates and updates the Astral database.

ASTRAL
The ASTRAL compendium provides databases and tools useful for analyzing protein structures and their sequences. It is partially derived from, and augments the SCOP: Structural Classification of Proteins database. Most of the resources provided here depend upon the coordinate files maintained and distributed by the Protein Data Bank.

CAPRI
Critical Assessment of Predicted Interactions allows to assess the capacity of protein-protein docking methods to predict protein-protein interactions. CAPRI is a community wide experiment designed to assess those that are based on structure. Its targets are unpublished crystal or NMR structures of complexes, communication on a confidential basis by their authors to the CAPRI management. Participant predictor group are given the atomic coordinates of two proteins that make biologically relevant interactions.

They model the target complex with the help of the coordinates and other publicly available data (sequence, mutations etc), and subunit sets of ten models for assessments on the CAPRI website. After the prediction round is completed, the CAPRI assessors compare the submissions to the experimental structure, evaluate the models on criteria that depend on the geometry and biological relevance of the predicted interactions.

Comparative Modeling (Homology Modeling) Servers
They are continuously and automatically evaluated by EVA. There is also a structure prediction meta-server for difficult cases, the BioInfoBank Meta Server. For straightforward cases, comparative modeling is automated by SWISS-MODEL.

SWISS MODEL
SWISS-MODEL is a fully automated protein structure homology-modelling server, accessible via the ExPASy web server, or from the program DeepView (Swiss Pdb-Viewer). SWISS-MODEL consists of three tightly integrated components: (1) The SWISS-MODEL pipeline – a suite of software tools and databases for automated protein structure modelling.

(2) The SWISS-MODEL Workspace – a web-based graphical user workbench.

(3) The SWISS-MODEL Repository – a continuously updated database of homology modeServer a set of model organism proteomes of high biomedical interest.

BioInfoBank Meta Server
The BioInfoBank Meta Server offers a gateway to well-benchmarked protein structure and function prediction methods. Structural models collected from the prediction servers are assessed using the powerful 3D-jury consensus approach.

ConSurf
he ConSurf server (Glaser et al., 2003; Landau et al., 2005; Ashkenazy et al., 2010; Celniker et al., 2013; Ashkenazy et al., 2016) is a bioinformatics tool for estimating the evolutionary conservation of amino/nucleic acid positions in a protein/DNA/RNA molecule based on the phylogenetic relations between homologous sequences. The degree to which an amino (or nucleic) acid position is evolutionarily conserved (i.e., its evolutionary rate) is strongly dependent on its structural and functional importance. Thus, conservation analysis of positions among members from the same family can often reveal the importance of each position for the protein (or nucleic acid)'s structure or function. In ConSurf, the evolutionary rate is estimated based on the evolutionary relatedness between the protein (DNA/RNA) and its homologues and considering the similarity between amino (nucleic) acids as reﬂected in the substitutions matrix (Pupko et al., 2002; Mayrose et al., 2004).

Dali
The Dali server is a network service for comparing protein structures in 3D. One submits the coordinates of a query protein structure and Dali compares them against those in the Protein Data Bank (PDB). In favourable cases, comparing 3D structures may reveal biologically interesting similarities that are not detectable by comparing sequences. User can perform four types of structure comparisons:

(1) Heuristic PDB search - compares one query structure against those in the PDB.

(2) Exhaustive PDB25 search - compares one query structure against a representative subset of the Protein Data Bank.

(3)Pairwise structure comparison - compares one query structure against those specified by the user.

(4)All against all structure comparison - returns a structural similarity dendrogram for a set of structures specified by the user. Dhruv Ch (talk) 12:44, 4 January 2018 (UTC)