User:Kinkreet/Protein Science/Bioinformatics

Sequence alignment
There are various types of sequence alignment: global and local, and pairwise and global.

Global
Global alignment is where the whole length of the query sequence is compared with the whole length of the sequences in the database. It is the simplest alignment and is useful at looking at proteins which we know are homologous, such as the same proteins from different species; this allows us to find differences between the species. In drug design, this may allow for us to identify a tissue-specific target for the drug.

Only sequences which are adequately similar should be placed into the alignment, as non-similar sequences may alter the alignment of the whole set, provided it was a multiple sequence alignment.

Local
Local alignment aligns only sub-sequences within the database which align with the query sequence. This means only part of the sequences need to be homologous, allowing for sequences containing substantial non-related (unaligned) subsequences to still show a hit. This is useful for looking at, or identifying, proteins with similar domains, or a conserved motif/sequence of nucleotides.

cDNA of known proteins can be BLASTed against the genome to find matches; this is the basis of gene annotation. The cDNA will align with the exons of the genome, with gaps where the introns are.

Pairwise Sequence Alignment
Pairwise sequence alignment takes a query sequence, and compare it individually to another sequence, or a whole database of sequences, to identify sequences which have high similarities.

Multiple Sequence Alignment
Multiple Sequence Alignment (MSA) takes all the sequences in the query and try to align them all. It will introduce gaps where appropriate to help the overall alignment. However, gaps will induce penalties and so it might be better to take a sequence out of the query if it does not align well with the rest.

BLAST
BLAST (Basic Local Alignment Search Tool) is a pairwise, local alignment program, which works with nucleotide and peptide sequences, and also calculates the significance of the hits. It is based on the Smith-Waterman algorithm. It compares a query sequence against individual entries in a database to look for similarities, and return any sequences which the program deems significant. BLAST can also be downloaded to run queries against your own database, this is useful when the genome of the species you are studying is not on NCBI.

There are many types of BLAST searches, each comparing sequences of one form to another.

nucleotide blast and protein blast are the fastest BLAST searches. tblastn is slower because the nucleotide needs to be translated into a protein sequence using all 6 open reading frames (3 in each codon, times 2 because of two strands of DNA). tblastx is even slower, as it needs to translate 6x6 open reading frame combinations, although tblastx is the most sensitive.

The results page will only show the sequences which are homologous, other regions are not shown. And so when interpreting the results, one must consider whether the hit is homologous on the whole-protein level, or just at that region/domain.

Scoring/Substitution Matrices
Substitution matrices take proteins that we know are the same and compare their sequences to identify differences, more specifically looking at residues which are important and cannot be changed/substituted, along with residues which are important but can be changed. The latter will give us an idea of how similar two amino acids are.

From many comparisons, we can build up a scoring table (or matrix) which details the likelihood of one residue being substituted for another. This probability is combined with the abundance of the amino acid and represented numerically as a score, with a positive score meaning the substitution can occur without changing the structure and thus function of the protein, and a negative score if this is unlikely. For example, according to the BLOSUM62 matrix, a glutamate substituting to a glutamine have a score of 2, meaning it is not unlikely; in contrast, substitution of trytophan to aspartate have a score of -4, and is thus highly unlikely. Amino acids substituting for themselves always have a positive score, with rare amino acids such as tryptophan and cystine given higher scores.

Examples
PAM(Point accepted mutation) is one of the earliest set of amino acid substitution matrices, developed by Margaret Dayhoff in 1978 using multiple sequence alignment based on 1572 observed mutations in 71 families of closely related proteins. The number associated with the matrix corresponds to the expected number of point mutations (per 100 residues) to have occurred between the query and the database. Thus a higher number implies more tolerance to sequences which are separated by a larger evolutionary distance.

BLOSUM(BLOck SUbstitution Matrix) is used for more divergent sequences as it is generated from the multiple-alignment of evolutionarily divergent proteins. It looks at blocks where there are good alignments to see which regions are conserved. By being conserved, it is implied that it must have important function, and that the amino acids within this conserved clock must have significance. Thus, the BLOSUM matrix bases its scoring on these conserved blocks of sequences, rather than take the whole sequence.

This (BLOSUM 62) is the default matrix used by BLAST.

Continued
During an alignment, the total score of the alignment is calculated by adding the individual scores calculated by the substitution matrix. The BLAST program effectively try every alignment and return the alignments with the highest scores. Of course, this is computationally demanding and timely, and so the program takes a few shortcuts to quicken the process; in some cases, this might mean some homologous sequences are missed.

Apart from sequence identity/similarity, the BLAST program also takes into account gaps. Gaps means there is either an insertion or deletion in one of the sequences (possibly both). This is heavily penalized on the score of the alignment; and the penalty is proportional to the size of the gap.

In the results page, identity means the percentage of the residues which were identical; positive means the percentage of residues which were identical plus conservative substitutions.

The e-value is the most significant when interpreting results; it describes the likelihood that this sequence would have appeared by chance given the length of the query, the length of the alignment, and the size of the database (the expect value will tend to be better with longer queries and smaller databases, however the database should be as large as possible to ensure comprehesiveness). It can be viewed as the background noise and increases exponentially to the score.

The closer the score is to 0, the more significant the hit. The value for the threshold of the e-value can be changed to allow the results page to display sequences of lower similarity.

Limitations
There are different evolutionary pressures on different parts of the protein, or for different protein families. For example, an essential domain is less tolerant to mutations than accessory domains; mutations in the active site is less tolerated than mutations in the transmembrane domain. BLOSUM is generic and assumes equal evolutionary pressures, and thus equal rate of mutation. A more accurate scoring mechanism would be one which focuses on locations within the protein which are important, such as the active site. PSSM (Position-specific scoring matrix) takes this into account.

Patterns
BLAST is suitable for aligning sequences, but the sequences do not tell us anything about the function of the protein. Often structurally similar proteins share little sequence homology. As structure determines the function of a protin, finding similarities in structure might be a more accurate homology search.

Each domain or structure usually have a defined pattern in sequence, for example, α-helical bundles usually have hydrophobic residues every 3 residues, so that the helix would have a hydrophobic face which associates with other helices.

Therefore, instead of searching for sequence homology based on residue substitutions, a search can be made to find whether a pattern exists within a sequence. The pattern can be pre-defined, or it can be obtained from a MSA. From a MSA, we can identify regions within the sequence which is conserved; if we assume that these regions are of importance to the protein, then we can also assume that they form an important domain or structure within the protein. The sequences can be isolated and identified for patterns, looking for which residues are absolutely conserved, and which are general (e.g. any hydrophobic or aromatic residues will do). From this a pattern is built.

A pattern is written in notations: [] means any residues within the square bracket is equally likely X means any residue is possible is a multiplier {} means any residues but those within the bracket

Patterns are suitable for small motifs with relatively high sequence similarity. Both patterns and profiles can be searched online at PROSITE, a database of protein domain, families and structural motifs.

Position Specific Scoring Matrices (PSSM) and PSI-BLAST
Amino substitution is not uniform due to different evolutionary pressures, as mentioned before. Different evolutionary pressures on different segments and residues means a general substitution matrix would not be accurate.

PSSM is a scoring matrix which provides scores specific to the position of the residue. It first takes sequences which are known to be homologous, and then producing a multiple sequence alignment based on this. It gives a score of the relative frequency at which a particular residue appear at each position of the multiple sequence alignment; the higher the frequency (and thus probability), the higher the score. Also, the more conserved a residue, the lower its score if it is substituted. Because the scores are position-specific, the same side chains substitution can have different scores at different positions.

Using this profile (another name for PSSM), we can search from random sequences from a database, to identify sequences which are homologous. PSSM differs from BLAST because its score is biased towards important residues and domains, thus will not be affected by a background score from residues which can be more freely substituted. Thus, it might be able to find less obvious sequences missed by BLAST.

PSI-BLAST is a program which uses BLAST for to BLAST an initial sequence, from the significant hits it then builds a multiple sequence alignment from which a profile is derived. It will then work like PSSM in subsequent iterations to find new sequences. Any significant findings are incorporated into the existing profile and searched again. The search is considered finished when there is convergence, or that no new sequences from the database can be found; this usually take 3-4 iterations. PSI-BLAST is most effective if a broad range of homologous proteins from many species are used as the starting profile.

Hidden Markov Models (HMM)
We have seen a general substitution matrix, where the score is based on amino acid regardless of position, we have then seen PSSM which score is based on amino acid and position; we then also looked at patterns which score is based on amino acids relative to each other.

The Hidden Markov Model (HMM) takes all these into account, plus a consideration for gaps, to generate the most sensitive homology search. First, it takes homologous sequences and perform a multiple sequence alignment. It then builds a probability map (from the frequencies of the MSA) of the likelihood of each possible transition from one residue to the next, taking into account preceding and following amino acids. For example, from position 102, a serine, there is 50% that it will be followed by a gap, 20% it will be followed by a hydrophobic residue, and a 30% chance it will be a deletion. Sequence similarity is generated from traversing the query through the probability map from the beginning of the sequence to the end of the sequence, multiplying each probability. If the probability is above a threshold, it can be deemed as homologous.

HMMER is a program used to generate and/or search HMM databases.

Overview
BLAST is the fastest but the least sensitive, patterns and profiles are more sensitive, but HMM is the most sensitive and able to find more distantly related sequences.

Protein Structure Prediction
Even pattern sequence searches provide limited information on the structure of the protein; since structure infers function, directly predicting the structure of a protein and searching that can be a more accurate way of searching.

Furthermore, Structure can be conserved even when proteins are only about ~25% in sequence identity, and so sequence similarity is not an accurate measure of structural identity.

Structures can be elucidated using experimental techniques, but these are slow and expensive, and often impractical. For example, X-ray crystallography requires pure proteins and a clear crystal, NMR can only be practically used on small proteins, and cryo-electromicroscopy have low resolution.

Bioinformatics by no means replaces experimental techniques, but it can provide a fast baseline from which to work from.

It aims to identify the likely structure of that residue. It assumes that the local structure is only determined by local sequences; it disregards long range effects from distant residues (on the primary sequence level, but not in the spatial level).

Secondary Structure prediction can be carried out using programs such as Jpred.

Typically, three secondary structures are assumed: α-helices, β-sheets and coils; this is called Q3 prediction. The Q3 gives a score based on the accuracy based on the percentage of correctly-predicted residues at each position. This measure of accuracy is crude but still widely used.

Like PSSM, we can also build a profile of the predicted secondary structure alignment, and use that to identify new sequences. If this profile is built from proteins with known folds, then the accuracy of the search will be higher.

Both sequence and structural profiles can be searched together to give a more reliable result.

Phylogenetic Tree
A phylogenetic tree is a graphical representation of the evolutionary relationship between species or sequences, such as protein families. There are many variations in trees:

Scaled/Unscales - On scaled trees, the length of the line represents the evolutionary distance (number of mutations) between the two species/sequences.

Rooted/Unrooted - Unrooted trees shows only the relationship between species/sequences. The nodes on an unrooted trees do not represent a common ancestor, whereas nodes on a rooted tree represents a common ancestor. An unrooted tree can be rooted if an outgroup is provided. An outgroup is a sequence/species that is closely related to the other groups in the tree, but less closely than any single one of the other groups is to each other. The outgroup roots the tree as it will have branched from the parent group before the other groups branched from each other. There are many ways of rooting a tree, and it is often difficult to pick the one which is correct.

Building
A tree requires multiple nodes, and each node is representative of a sequence; thus to build a tree, a multiple sequence alignment must first be produced. ClustalW (Distance matrix/nearest neighbor), MAFFT, Muscle, T-Coffee, PAUP* (Maximum parsimony, distance matrix, maximum likelihood), PHYLIP (Maximum parsimony, distance matrix, maximum likelihood), TREE-PUZZLE (Maximum likelihood) and MrBaves (Bayesian inference) are examples of tree-building programs.

ClustalW is the most common program used, and its output is represented by notation:
 * means the residue at this position is conserved in all the sequences
 * means the residue at this position all have roughly the same size and hydrophobicity

. means the residue at this position all have roughly the same size or hydrophobicity

Algorithms
There are different algorithms used for tree building. Three main methods are used: Distance-based (UPGMA, neighbour joining); Character-based (maximum parsimony, maximum likelihood); and Bayesian.

Distance-based
Distance-based algorithms measure the number of changes between pairs of sequences. A matrix of distances between sequences are made and used to draw the map. This is a crude method and often used on the first step of a MSA. It is often adequate if there are enough data.

UPGMA (Unweighted Pair Group Method with Arithmetic Mean) is fast distance-based algorithm. It is simplified by several assumptions. It assumes that the evolutionary pressure of all the proteins in the tree are the same, and that the terminal nodes are equally distant from the root (or ultrametric, which is an extension from the first assumption). UPGMA gives a rooted and unscaled tree.

Neighbour Joining is another distance-based algorithm which does not make the assumptions made in UPGMA. It is slower and can handle less sequences. Therefore, UPGMA should be used with large number of sequences which we know have roughly similar evolutionary pressures, and neighbour joining can be used for a more accurate tree. Neighbour joining gives an unrooted and scaled tree.

Character-based
Character-based algorithms are more realistic than distance-based, but is also more computationally-demanding

Maximum parsimony is an algorithm which aims to minimize the distance between the sequences; the tree with the least overall distances is assumed to be the best tree. It calculates distance by looking at the number of residues which are mutated from one to the other. However, as we know, different segments of a sequence have different evolutionary pressures, and so maximum parsimony also takes this into account. It looks only at 'informative sites', defined as sites on a MSA where there are at least two different kinds of residues at the site, each of which is represented in at least two of the sequences under study. These informative sites are likely to be important and so will give us a more sensitive measure of distance.

A limitation of character-based algorithm is that it does not take into account prior knowledge. It is based on the number of residues mutated and the number of residues conserved, but any substitution is considered to be one mutation, and this is not always the case.

Maximum likelihood take into account evolutionary models (specific to species/kingdom/phylum) and prior knowledge. For example, a change from glutamate to phenylalanine can be seen as one mutation at the protein level, but if we look at the codon (species specific) which encodes glutamate and phenylalanine in that species, we might find that at least two mutations is required for that change. Thus instead of one change (as assumed by maximum parsimony), two changes are assumed. Maximum likelihood also do not assume evolutionary pressure is same for all sequences/species.

Thus maximum likelihood will give a more realistic tree, but both algorithms still requires the sequences to be quite similar.

Because multiple trees are possible, and due to the nature of the algorithm, the same search can give different trees. A method called boot-strapping is used, where the searches are repeated and the tree that comes out most consistently is picked.

Bayesian
The Bayesian uses a method similar to maximum likelihood but uses Markov chain Monte Carlo sampling algorithms