Talk:Structural alignment

September 2020
Hello, I'm Wtmitchell. I wanted to let you know that I reverted one of your recent contributions —specifically this edit to Structural alignment—because it did not appear constructive. If you would like to experiment, please use the sandbox. If you have any questions, you can ask for assistance at the Help desk. Thanks. Wtmitchell (talk) (earlier Boracay Bill) 18:40, 21 September 2020 (UTC)
 * If this is a shared IP address, and you did not make the edits referred to above, consider creating an account for yourself or logging in with an existing account so that you can avoid further irrelevant notices.

Mammoth Discussion
Hello WtMitchel. Let me introduce myself, i'm CEMS2. I am one of the original author's of the Rosetta Structure prediction program. I have a long set of publications on the field of how to compare protein structures. And in How to use protein structure comparisons for many purposes. It's this last point which is subtle. There are an infinity of different structure overlap algorithms and the last thing the planet needs is yet another. And the reason we keep getting more and more of these is because people are not stopping to ask-- why do you want that structure overap anyhow?

It's the most important question. Is is local structure for function, is it evolutionary? Are you comparing mutations of the same protien? Are you looking for remote homologs? and you trying to figure out which resides should be paired in a new Blossum alignment matrix. When you ask that question a few algorithms stand out from the crowd. Many are not here, but mammoth that I am adding was really the very first one, that actually asked that question.

contemporary in time (2002) to SSAP (1998) and before TM. Notably Mammoth and DALI were the primary method structures have been compared at CASP.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5479680/  It is one of four methods offered by the PDB (RCSB now) for the structural alignment tools.

so this is one of the grand daddy algorithms. it's key niches: What if you are comparing decoy structures from stucture prediction to real protieins? what about remote homologs where most superposition algorithms fail? Mammoth was the first algorithm to recognize that what you care most about is the E_value not the actual 3D overlap. How likely is this overlap by chance? SO it adds a lot to the discussion.

Below is a section in development.

Mammoth
MAMMOTH approaches the alignment problem from a different objective than almost all other methods. Rather than trying to find an alignment that maximally overlaps the largest number of residues, it seeks to arrive at a well defined probability that its structural alignment 3D overlap was not random chance. To do this it marks a local motif alignment with flags to indicate which residues simultaneously satisfy more stringent criteria simultaneously 1. Local structure overlap 2. regular secondary structure 3. superposition in 3D 4. same ordering in primary sequence. It converts the statistics of the number of residues with high-confidence matches and the size of the protein to compute an Expectation value for the outcome by chance. It excels at matching remote homologs, particularly structures generated by ab initio structure prediction to structure families such as SCOP, because it emphasizes extracting a statistically reliable sub alignment and not in achieving the optimal sequence alignment or maximal 3D overlap.

For every overlapping window of 7 consecutive residues it computes the set of displacement direction unit vectors between adjacent C-alpha residues. All-against-all local motifs are compared based on the URMS score. These values becomes the pair alignment score entries for dynamic programming which produces a seed pair-wise residue alignment. The second phase uses a modified MaxSub algorithm: a single 7 reside aligned pair in each proteins is used to orient the two full length protein structures to maximally superimpose these just these 7 C-alpha, then in this orientation it scans for any additional aligned pairs that are close in 3D. It re-orients the structures to superimpose this expanded set and iterates till no more pairs coincide in 3D. This process is restarted for every 7 residue window in the seed alignment. The output is the maximal number of atoms found from any of these initial seeds. This statistic is converted to a calibrated E-value for the similarity of the proteins.

Mammoth makes no attempt to re-iterate the initial alignment or extend the high quality sub-subset. Therefore the seed alignment it displays can't be fairly compared to DALI or TM align as its was formed simply as a heuristic to limit the search space. (It can be used if one wants an alignment based solely on local structure-motif similarity agnostic of long range rigid body atomic alignment.) Because of that same parsimony, it is well over ten times faster than DALI, CE and TM-align. It is often used in conjunction with these slower tools to pre-screen large data bases to extract the just the best E-value related structures for more exhaustive superposition method.

It has been particularly successful at analyzing "decoy" structures from ab initio structure prediction. These decoys are notorious for getting local fragment motif structure correct, and forming some kernels of correct 3D tertiary structure but getting the full length tertiary structure wrong. In this twilight remote homology regime, Mammoth's e-values for the CASP protein structure prediction evaluation have been show to be significantly more correlated with human ranking than SSAP or DALI. Mammoths ability to extract the multi-criteria partial overlaps with proteins of known structure and rank these with proper E-values, combined with its speed facilitates scanning vast numbers of decoy models against the PDB data base for identifying the most likely correct decoys based on their remote homology to known proteins.

From PNA/Biology

 * Structural alignment is vague; needs more info; should be more specifically named (perhaps Protein structural alignment --Stewart Adcock) +sj+ 09:52, 2004 Feb 22 (UTC)

This entry is not only bad, but probably full of mistakes. Plese feel free to voice your opinion, create new topics where appropriate and generally change at will the contents of this page.

The correct refence to cite SARF2: Alexandrov, N.N. SARFing the PDB. Protein Engineering (1996), 9:727-732. There is a mistake in the authorship of SARF2 program.


 * You aren't wrong about it being full of mistakes!
 * Some approaches use quaternerions to reduce the dimensionality of the space -- Actually, quaternions increase the dimensionality of rotation space.
 * The Dali algorithm (named after Salvador Dali) uses network isomorphism between the contact networks of two proteins to perform alignment. -- DALI is an abbreviation of Distance matrix Alignment.
 * Stewart Adcock 23:21, 20 Feb 2004 (UTC)

I think you should be brave and fix this page!


 * Don't worry, I definitely will -- when I have time. I even have a report that I could wikify. Stewart Adcock 17:15, 16 Mar 2004 (UTC)


 * Cool! --Dan 15:34, 1 Apr 2004 (UTC)

Did a complete rework, hope it is a bit clearer... One could expand on the algorithmic side of things, as well as on the use of the method. Will try, as soon as I find time. Dr. Strangelove 20:00, 19 Jul 2004 (UTC)

---

The following comments made by me, Andrew Dalke  on the Biopython mailing list. I don't like learning a bajillion wiki formatting conventions so I'm leaving this as-is on the talk page.

If it's two conformations of the same structure and the goal is to minimize overall RMSD through a single global alignment matrix then the usual reference I know is Kabsch

[Kabsch, 1976] Kabsch, W. (1976). A solution for the best rotation to relate two sets of vectors. Acta. Crystal, 32A:922-923. [Kabsch, 1978] Kabsch, W. (1978). A discussion of the solution for the best rotation to related two sets of vectors. Acta. Crystal, 34A:827-828.

(The first had an ambiguity that could cause a sign error; fixed in the second.)

Several structure program implement that including: O - http://xray.bmc.uu.se/usf/factory_4.html VMD - http://www.ks.uiuc.edu/Research/vmd/vmd-1.7.1/ug/node183.html PyMol - http://www.pymolwiki.org/index.php/Kabsch

This algorithm is not mentioned on the Wiki page and is 11 years older than the oldest mentioned reference. It isn't NP-hard as alluded to in the Wiki, since it's solving a much simpler problem.

The date for ProFit is strange on the Wiki. I knew about ProFit before the claimed 1996. The web site says http://www.bioinf.org.uk/software/profit/faq.html

No paper has been published describing ProFit itself since it is simply a convenient program (I hope) to let you use a standard fitting algorithm consequently, it is a little difficult to reference. The exact wording is up to you and dependent on the context, but I suggest something similar to:

Fitting was performed using the McLachlan algorithm (McLachlan, A.D., 1982 ``Rapid Comparison of Protein Structres'', Acta Cryst A38, 871-873) as implemented in the program ProFit (Martin, A.C.R., http://www.bioinf.org.uk/software/profit/).

The McLachlan algorithm is also not mentioned in the Wiki.

I found a usenet announcement about ProFit at http://groups.google.com/group/bionet.software/msg/f219c5163bbadbdc?dmode=source&hl=en

From: mar...@bsm.bioc.ucl.ac.uk (Andrew Martin) Subject: ANNOUNCE: Protein Fitting Software Date: 1995/07/20 Message-ID: <1995Jul20.101545.37119@ucl.ac.uk>#1/1 X-Deja-AN: 106640715 organization: University College London newsgroups: bionet.software I've finally (after about 2 years) got around to fixing the last (?) remaining bugs in my ProFit protein least-squares fitting software and decided to make it generally available.

so the date should be 1995 or 1993 or earlier. I suspect the 1996 date came from the "first written" date for the documentation, at http://www.bioinf.org.uk/software/profit/doc/

what happened to the image? Str-alignment.png

according to the log the link was removed because the image was deleted from wikimedia

any way to link it back?

Software List
the list of available programs is gone.... think it was a good idea. —Preceding unsigned comment added by 146.203.21.23 (talk • contribs) 12:16, 3 July 2006
 * It was moved to Sequence alignment software, currently linked in the See Also section. It'll get a more prominent link when the list is in better shape. If you have any new additions, or a perspective on which packages are the most commonly used, that would be great! Opabinia regalis 00:11, 4 July 2006 (UTC)
 * 'Sequence alignment software' perhaps is not the most accurate place for structure based alignment programs. —Preceding unsigned comment added by 146.203.21.237 (talk • contribs) 14:46, 7 July 2006
 * Structural alignment software now redirects there. I think the advantages of having everything in one place outweigh the minor disadvantage of following a redirect, but I'll put the question on the sequence alignment peer review if anyone else wants to chime in. Opabinia regalis 01:15, 8 July 2006 (UTC)

GA review

 * 1. Well written? Pass, upon changes below.
 * 2. Factually accurate? Pass
 * 3. Broad in coverage? Pass
 * 4. Neutral point of view? Pass
 * 5. Article stability? Pass
 * 6. Images? Pass
 * 6. Appropriate referencing? Pass, upon changes below.

Additional comments: Some changes needed.
 * However, at O(n10 / ε6) for a globular protein of n residues, the algorithm is still too expensive for practical use.
 * I don't think the algorithm itself is expensive, don't you mean expensive to implement in terms of CPU time?


 * Representation of structures
 * No references in this section, could add one to cover the statement about improving data by discarding noise.


 * This representation is expensive because the features in the square matrix are symmetrical...
 * Again, expensive isn't what you mean. TimVickers 14:06, 30 September 2006 (UTC)

Thanks for your image and useful copyedits. I added the relevant ref and clarified your examples to "computationally expensive" and "memory-intensive" respectively; I hadn't thought about that usage being confusing. Opabinia regalis 15:54, 30 September 2006 (UTC)

Pls remove TM-Align
In the Methods section, only DALI, SSAP and CE should be included (but not TM-Align). DALI, SSAP and CE are considered "classical methods" in this area, and the papers on them have received a large number of citations (100+ for each). We should not describe the new and not-so-popular-yet methods like TM-Align, because there also exist several other structural alignment methods. (Please see http://en.wikipedia.org/wiki/Structural_alignment_software). If we include TM-Align, we also should include all of these methods. However, I would like to recommend to add some popular methods like: VAST (recognized by PDB), FATCAT (recognized by PDB), and SSM (recognized by SCOP).
 * Thanks, for some reason this article fills up with good-natured but promotional text if unwatched. I've shortened the TM-align section (it was rather wordy and diffuse anyway) out of desire not to give new and 'exciting' methods undue weight; however, as a Zhang/Skolnick project I think it's notable enough to merit a blurb. The new arrangement also provides an easy way to add other new developments without bloating the article.
 * On the other methods, feel free to add sections on them if you're interested. VAST in particular, as it's quite old and maybe not used so much anymore, but a notable piece of the development of these techniques. Opabinia regalis 01:46, 20 December 2006 (UTC)

Why not explain the algorithm?
I would like to add a section explaining how the optimal alignment for a pair of molecules is computed using SVD.--Plediii (talk) 21:03, 1 February 2009 (UTC)


 * Apparently what I was looking for is subtly linked at the end of the introduction. I'm trying to wrap my head around why these are separate articles. --Plediii (talk) 15:06, 2 February 2009 (UTC)