Wikipedia:Chemical infobox/Workshop

Discussion at the moment focused on chembox, but will also be applicable to drugbox and maybe also protein

Points:
 * Upgrades necessery for making the data more 'machine searchable'?

General
A good way of storing the data would be in an xml-type structure, which would be possible with templates:

{{chembox }} }} }}
 * Field1 = {{chembox IUPACName
 * name=water
 * Section1 = {{Chembox Properties
 * Field1 = {{chembox BoilingPoint
 * value = 100
 * unit = C

The main problem is that this is going to be a 'transclusion hell', resulting in it being highly error-prone and difficult to edit for people who are not in the field. What that regards the current structure is better.

{{chembox }} }}
 * IUPACname = water
 * Section1 = {{Chembox Properties
 * BoilingPt = 100&deg;C

This is easy to edit/correct for even very new editors but still consistent. It does at the moment have some problems, though.

Hereunder some discussion on how certain fields would need to be reformatted to give consistent data, with a scope on 'machine searchability' (including using search-engines and automated fact-checking).

Identifiers
Identifiers are a trouble, sometimes (often?) they do not point to the proper compounds, or compounds have more than one of the same identifier. Also certain identifiers are specific to certain 'types' of compounds/materials.

The only way to do this properly is to link to the information provided by the authority itself. Thus EINECS points to the appropriate site. If the authority does not provide information for free then I personally would not point to the site but others may wish to.Petermr (talk) 19:21, 16 January 2008 (UTC)

I agree with Peter. EINECS/EC-number do link appropriately I believe and the link should be one to one..not as a search to provide multiple results. Somebody must control the "quality" of the link when populating the ChemBox (or curate later as I am doing..I say get it right initially of course). This is why CAS numbers should not link. The authority for CAS numbers is closed to linking so don't.--ChemSpiderMan (talk) 19:30, 16 January 2008 (UTC)


 * OK, I will make it so. --Dirk Beetstra T  C 19:42, 16 January 2008 (UTC)

CAS

 * No link.

You could link also to PubChem and to ChemSpider...BOTH are dangerous for the purpose of CAS "searches" unless work is done to validate that the CAS is connected to the right structure. In my opinion the user should choose to use the CAS number for the basis of a search in any of the multiple databases but Wikipedia should not link out to perform a search but make sure that the CAS number listed is ONE of the RIGHT ones (or multiple) for the structure displayed in the Structure box. Then, the challenge is making sure the structure is consistent with the article name. There are many cases where the structure and the name of the article don't match too. The article name should be seen as the primary key for Wikipedia I believe.--ChemSpiderMan (talk) 19:18, 16 January 2008 (UTC)

This is entirely inappropriate - emolecules is not an authority of CAS and offers wrong information (see nitric oxide where the first two structures are not nitric oxide). The only legitimate thing is to link to CAS : | STNEyas which costs ca 6 USD per CASNO. Petermr (talk) 19:21, 16 January 2008 (UTC)


 * Good points. The reason we did not choose PubChem was that it was already linked via PubChem, which should give a more accurate hit (when the correct number is there, at least).  Also, searching on CAS in PubChem gives multiple hits.  I can favour removing the link at all, though I think that a good 'linkfarm' type link should be there in the chembox, as we would like to have a linkfarm-hit to suppliers data (which PubChem does not supply).  Suggestions (we could add another field to the Identifiers section)?  --Dirk Beetstra T  C 19:38, 16 January 2008 (UTC)


 * I can of course suggest ChemSpider as a link farm to suppliers but I am biased of course and will immediately get thrashed for doing so. The list of data sources for the many suppliers for ChemSpider is here, for PubChem is here. I cannot find a full list for eMolecules (maybe you can) but a list of their recent updates is here. I think it would be less trouble for me/ChemSpider if you left the linkfarm to eMolecules personally. I've taken enough public thrashings so far :-( --ChemSpiderMan (talk) 23:54, 16 January 2008 (UTC)

EINECS/EC-number

 * Link http://ecb.jrc.it/esis/index.php?GENRE=ECNO&ENTREE=
 * Should be called EC-number, but there are still a lot of EINECS out there (see Special:Whatlinkshere/Template:Chembox_EINECS Remark:could be an action for chem-awb, rename all EINECS to EC-number and then delete the old fields from the template).

I believe it should be referred to as an EC-Number now?--ChemSpiderMan (talk) 19:19, 16 January 2008 (UTC)
 * Correct, I think. --Dirk Beetstra T  C 19:40, 16 January 2008 (UTC)


 * I could easily do it. I was wondering if it is simpler to just "redirect" einecs to EC-number, such that they both display as "EC-number" in the table regardless of how they are entered. --Rifleman 82 (talk) 15:48, 17 January 2008 (UTC)
 * Editing the template is simpler, indeed. I'll have a look at it.  --Beetstra (public) (Dirk BeetstraT  C on public computers) 12:18, 18 January 2008 (UTC)

PubChem

 * Link: http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=

This is fine provided that the structure displayed in the ChemBox is the one you link to on PubChem. Not a different tautomer or different stereochemistry etc. Once I've handed back the SDF file then the InChIKey can be used to search Pubchem for a direct match ONCE PubChem supports InChIKeys.--ChemSpiderMan (talk) 19:20, 16 January 2008 (UTC)

InChI
Used to display the field InChI as is. Problem is that these fields have to be broken to display correctly.
 * No link


 * Just created the fields FullInChI and DispInChI. Working is now, if DispInChI is provided it displays that, else it displays the field InChI .. If we want to make this searchable, then it works in the end as follows:
 * Use InChI or FullInChI as the field for the unbroken link, use DispInChI to make the text as it is shown on-screen.
 * Requires that the original InChI-field is copied to DispInChI, and that the original is then stripped of a.o. &lt;br />


 * Problem: putting InChI into a link does not work, the URL breaks on some of the character(-combinations) used in an InChI. May need some 'higher intervention' ... when using the DispInChI we can circumvent some problems by using url-allowed codes to make the InChI correct.

I see the purpose of InChISTRINGS as a way for the user to convert back to a structure...as SMILES are used presently. Indexing InChIStrings is problematic and therefore the InChIkey was encouraged by Google. InChiKEYS will be valuable but will need to be generated using the appropriate "standard settings"...a discussion initiated by the InChI Team just last week. Otherwise these keys will be less valuable. Stereochemistry, specific tautomer etc will be essential for the structure for this to be useful. If I wa to use the InChIKey as a link it would be to catalyze a search via a search engine but how to make a politically Correct decision about whether its Google, Yahoo or Microsoft? WP should be neutral to all.--ChemSpiderMan (talk) 19:34, 16 January 2008 (UTC)
 * The solution is probably to take InChIs out of the chembox altogether and use the brand-sparkling-new InChI instead… Physchim62 (talk) 14:21, 20 January 2008 (UTC)
 * I think it has a right place in the chembox, that's where all the data goes. The only problem is, how to format it properly.  I have today been playing with table-width, but it seems buggy, it does not force a width, only as a minimum.  --Dirk Beetstra T  C 18:31, 21 January 2008 (UTC)

SMILES
Used to display the field SMILES as is. Problem is that these fields have to be broken to display correctly.
 * No link


 * Just created the fields FullSMILES and DispSMILES. Working is now, if DispSMILES is provided it displays that, else it displays the field SMILES .. If we want to make this searchable, then it works in the end as follows:
 * Use SMILES or FullSMILES as the field for the unbroken link, use DispSMILES to make the text as it is shown on-screen.
 * Requires that the original SMILES-field is copied to DispSMILES, and that the original is then stripped of a.o. &lt;br />


 * Problem: putting SMILES into a link does not work, the URL breaks on some of the character(-combinations) used in an SMILES. May need some 'higher intervention' ... when using the DispSMILES we can circumvent some problems by using url-allowed codes to make the InChI correct.

There are a number of "incorrect SMILES". What do I mean by this? I take the SMILES and try to convert using standard tools and they do not convert. Maybe this is the tool itself? Maybe the SMILES DO convert in other tools? There are many SMILES available consistent with a structure. There are canonical SMILES too. How can we recommend where SMILES come from...what tool to generate them. If they do not convert back to structures what's the point. How to validate? It would be possible to provide a validation tool by passing the SMILES out to some external app to display online and have the user confirm its correct. For example OpenBabel to convert and display in Jmol. It's an example. What's the purpose of the SMILES? I assume to allow people to grab it and convert to a structure in their desktop software? That's how I use it...but it does fail quite regularly and the line breaks are a problem.--ChemSpiderMan (talk) 19:26, 16 January 2008 (UTC)

RTECS

 * No link

I suspect this may be an authority which does not offer numbers online.Petermr (talk) 19:21, 16 January 2008 (UTC)

RTECS is now controlled by MDL (Symyx) and the links will not go anywhere. I am not sure anyone will worry about the ID myself?--ChemSpiderMan (talk) 19:28, 16 January 2008 (UTC)

KEGG

 * Link http://www.genome.ad.jp/dbget-bin/www_bget?cpd:

MeSHName
Generally a GOOD quality source for validation. NOT perfect but extremely good...manually curated--ChemSpiderMan (talk) 19:21, 16 January 2008 (UTC)
 * Link http://www.nlm.nih.gov/cgi/mesh/2007/MB_cgi?mode=&term=

ATCCode/ATCCode_prefix/ATCCode_suffix/ATC_supplemental

 * Link http://www.whocc.no/atcddd/indexdatabase/index.php?query=1

ChEBI
What's the purpose of the link to ChEBI?--ChemSpiderMan (talk) 19:21, 16 January 2008 (UTC)
 * Link http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:1
 * Purely compatibility with the drugbox I believe (guess I 'stole' it there, I incorporated as many fields at that point as I could find), don't know what 'extra' information it can yield. --Dirk Beetstra T  C 19:32, 16 January 2008 (UTC)

EINECSCASNO
depracated, still there for historical cases, should be found, split and removed.

Units
It has been suggested to remove the units from the fields, and make them only show the number.

Taking Boiling_Pt and Melting_Pt as an example, these two fields now show the BP/MP as text, the fields Boiling_PtC, Boiling_PtK, Boiling_PtF take resp. the temp in degrees Centigrade, Kelvin or Fahrenheit (without a unit) and then display correctly with a unit.
 * I agree with the suggestion. Fully supported.--ChemSpiderMan (talk) 19:22, 16 January 2008 (UTC)


 * What would be optimal is using BoilingPt itself as the reference to e.g. the boiling point in Kelvin (using negative numbers in Centigrade seems to break sometimes, have to look into that). In that case one could actually search for BoilingPt=100 and find a.o. water.  This would need quite some reworking, but it could be done 'botwise'  That does go for many of the datafields, by the way.  --Dirk Beetstra T  C 19:34, 16 January 2008 (UTC)