User:Itub/Ambiguous chemical identifiers

What should the CAS number and other identifiers in a chemical infobox mean? Ideally, they should describe the chemical described by the title of the article. This is not always as trivial as it sounds, due to ambiguous article titles, articles dealing with multiple substances, and variations in protonation and tautomerization. In many cases there are CAS numbers for "ambiguous substances" (e.g., mixtures or substances with unspecified stereochemistry) and these should be used when appropriate. With identifiers such as SMILES and INCHI, it is generally possible to omit the stereochemical information, but this does not work for other types of isomers.

This page tries to classify these ambiguous cases and give specific examples of existing articles of each type, so that we can use them as case studies.

Ambiguous article title

 * Set of compounds given equal importance. These can usually have a "generic CAS" in the infobox, and may have a supplementary table elsewhere in the article with data for specific compounds.


 * Structural isomers: xylene. No usable generic inchi or smiles.


 * Stereoisomers: tartaric acid. There can be a generic InChI or SMILES if the stereochemical information is removed.


 * Ambiguous stereochemistry in title, but article really focuses on a specific isomer: alanine focuses on L-alanine, the standard enantiomer. There may be some passing mention of other isomers. This is the typical situation for many chiral natural products and drugs. The infobox should focus on the main isomer, and data for the minor one can be given elsewhere if necessary.

Lactic acid has four CAS numbers "generic" (?), L, D, and racemic (?). It uses CASOther and hard HTML breaks to format the table cell to look nice.

Glucose has three main isomers (open, plus alpha and beta pyranose forms) not counting the unnatural L-glucoses and the furanose forms. All three could be given in the infobox, but perhaps just one structure diagram is enough. The current infobox has one CAS number for D and one for L. Is it maybe a generic number that doesn't specify an isomer? (Will have to check.)

Salts

 * Hydrochlorides of neutral bases (e.g., cocaine). Probably better to focus on the free base in the infobox unless the article is really specific to the salt.


 * Ionic bases such as ATP. Good question. They are often depicted deprotonated, usually as the ion that is believed to predominate at physiological pH. However, CAS has it protonated. I don't know if CAS has a separate entry for deprotonated ATP (perhaps it tends to normalize these types of natural products to the neutral/protonated state), but it does have separate entries for acids and conjugate bases of common acids. For example, phosphoric acid (7664-38-2) vs phosphate (14265-44-2).

In some cases there may be more than one salt in common use. In there are "too many" we could add an extra table.

Tautomers
Same as the above case with ATP, I don't know how often CAS just chooses one form, and how often has a separate entry for each tautomer. I suspect it has each tautomer for the simple cases (acetaldehyde/vinyl alcohol) and maybe for important biomolecules such as the standard DNA bases. More research is needed. ;-)

General suggestions
It seems impossible to have a unique compound/identifier per article in all cases, because often the variations on a compound are so minor that don't deserve separate articles. Therefore we need to give annotated identifiers more often. We can give a "generic identifier" as the primary identifier when available and appropriate, but then we can also give the annotated individual identifiers. When there is no generic identifier available, any identifier that is given needs to be annotated. When there are only a few variations (let's say up to four or so), they could safely be added to the article infobox. But if there are too many, a separate table may be necessary. The article on tartaric acid does this at the moment by having a table called "forms of tartaric acid" in the body of the article, which lists five forms (generic, racemic, plus the three isomers).