Talk:Chemical similarity

Untitled
For "one of the most important concepts", this entry is woefully incomplete. I think only someone who knows what chemical similarity means could understand this page, and even then it's confusing.

Where to start. How about the reference to Johnson and Maggiora (1990) "which states: similar compounds have similar properties". Except that was also stated in Johnson's article in the Journal of Mathematical Chemistry April 1989, Volume 3, Issue 2, pp 117-145, where the abstract says "As an intuitive concept, molecular similarity has played a fundamental role in chemistry. It is implicit in Hammond's postulate, in the principle of minimum structure change, and in the assumption that similar structures tend to have similar properties,"

A Google Scholar search finds an earlier reference in Raymond E. Carhart, Dennis H. Smith, R. Venkataraghavan, "Atom pairs as molecular features in structure-activity studies: definition and applications", J. Chem. Inf. Comput. Sci., 1985, 25 (2), pp 64–73 DOI: 10.1021/ci00046a002: "It is perhaps not surprising to find this degree of clustering of psychotropic activity, even among non-benzodiazepines, around diazepam; that is consistent with the expectation that similar structures will frequently show similar properties"

(Ie, just because Willet et. al in J. Chem. Inf. Comput. Sci. (1998) 38, pp983 - 996 and others say that this is principle of Johnson and Maggiora, it doesn't mean that they were right, only that that's what most people use for the citation for it.)

There's no mention of using the MCS as a measure of chemical similarity, which goes back to at least the 1967 publication by Armitage, Crowe, Evans, Lynch, and McGuirk, "Documentation of Chemical Reactions by Computer Analysis of Structural Changes", J. Chem. Doc., 1967, 7 (4), pp 209–215 DOI: 10.1021/c160027a006, which says "Two of us (13) have recently described a general algorithm which extends ... automatic detection of similarities among chemical structures. Similarity has been defined for this purpose as the largest connected set of atoms and bonds common to a pair of compounds - i.e.., the maximum overlap of the graphs." (Not a complete definition, since it doesn't give a measure, but many people have since given more precise definitions.)

Problem is, that's hard to compute, and not as useful as you might think, which are two reasons why people have migrated to fingerprints.

There needs to be a description of what "fingerprint" means. As far as I can tell, the older (pre 1990s) term was "vector in descriptor space", but fingerprint usually means binary fingerprint or count fingerprints, and doesn't take on the reals that a vector can have.

There also should be a description of how to generate a fingerprint. If someone doesn't know what a "fragment-based structural key" means, are they really going to go to Durant et al.? Plus, the word "hash" doesn't even exist on the page, and the "Daylight" link, which goes to their home page, is unhelpful for if you want to know how the Daylight hash fingerprints work.

Also, the statement "are sufficiently good for handling small and medium-sized chemical databases, whereas processing of large databases is performed with fingerprints having much higher information density" does not give a citation and I am not aware of that conclusion in the literature. Remember too that this is in the context of the reoptimized MDL keys (that's what the citation says!), not the original 166 bit or 960 bit keys. (Then again, there is no primary citation to the original keys; everyone just says "MDL did it".)

Also, though not so relevant for the main article, the use of Tanimoto in chemical structure similarity goes back to at least George W. Adamson and Judith A. Bush, "A Comparison of the Performance of Some Similarity and Dissimilarity Measures in the Automatic Classification of Chemical Structures" J. Chem. Inf. Comput. Sci., 1975, 15 (1), pp 55–58 DOI: 10.1021/ci60001a016 which is 15 years before the current oldest citation.

However, I suspect that most of these comments will be considered "original research", since, well, it is. So I leave it here in the talk page, and if I write it up some day then someone else can cite me. -- Andrew Dalke

External links modified
Hello fellow Wikipedians,

I have just modified one external link on Chemical similarity. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
 * Added archive https://web.archive.org/web/20081011141517/http://www.bci.gb.com/ to http://www.bci.gb.com/

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

Cheers.— InternetArchiveBot  (Report bug) 02:46, 4 August 2017 (UTC)

If "distance measures" are just "metrics", then some explanations are wrong.
I'm not a chemist, but a mathematician. In mathematics, a "metric" or "distance function" in general is demanded to fulfil the triangle inequality. The euclidean metrics form a special case; like all metrics, they fulfil the triangle inequality, but they also fulfil some much more precise conditions. (Mostly, they are defined for linear or affine spaces over R, and associated to euclidean norms; the latter may be characterised by a relation between the norms of u, v, u+v, and u-v for arbitrary vectors u and v.)

However, this article contained the statement
 * ''Distance measures can be classified into Euclidean measures and non-Euclidean measures depending on whether the triangle inequality holds.

If this were an article on mathematics, I'd be rather sure that this just is an error. I still think that this is the case; the internal wp-links to the corresponding mathematical concepts follow the outlines supra, and the chemical source given in the next sentence seems to name only a few of its molecular kernel distance functions as euclidean, and these indeed seem to be based on the ordinary euclidean metric for 2D or 3D representations of the molecules in question. However, I also may have misunderstood that article, and the quoted statement has been in our article since 2008, without any complaints from chemically more knowledgeable editors.

I'm going to remove that sentence, based on my belief that the references to the mathematical concepts mean that there really is a nine years old misunderstanding here. However, if I'm the one who misunderstood some chemical concepts, then please revert my edit, but also add adequate wp-links and/or external sources explaining this rather "unmathematical" use of the term "euclidean measure". JoergenB (talk) 12:19, 20 August 2017 (UTC)