User:Charles Matthews/Facto Post/Issue 11 – 9 April 2018

Facto Post – Issue 11 – 10 April 2018
{| style="position: relative; margin-left: 2em; margin-right: 2em; padding: 0.5em 1em; background-color: #7FFFD4; border: 2px solid #00FFFF; border-color: rgba( 109, 193, 240, 0.75 ); border-radius: 8px; box-shadow: 8px 8px 12px rgba( 0, 0, 0, 0.7 );"
 * Facto Post – Issue 11 – 10 April 2018

 

The 100 Skins of the Onion
Open Citations Month, with its eminently guessable hashtag, is upon us. We should be utterly grateful that in the past 12 months, so much data on which papers cite which other papers has been made open, and that Wikidata is playing its part in hosting it as "cites" statements. At the time of writing, there are 15.3M Wikidata items that can do that.

Pulling back to look at open access papers in the large, though, there is is less reason for celebration. Access in theory does not yet equate to practical access. A recent LSE IMPACT blogpost puts that issue down to "heterogeneity". A useful euphemism to save us from thinking that the whole concept doesn't fall into the realm of the oxymoron.

Some home truths: aggregation is not content management, if it falls short on reusability. The PDF file format is wedded to how humans read documents, not how machines ingest them. The salami-slicer is our friend in the current downloading of open access papers, but for a better metaphor, think about skinning an onion, laboriously, 100 times with diminishing returns. There are of the order of 100 major publisher sites hosting open access papers, and the predominant offer there is still a PDF. From the discoverability angle, Wikidata's bibliographic resources combined with the SPARQL query are superior in principle, by far, to existing keyword searches run over papers. Open access content should be managed into consistent HTML, something that is currently strenuous. The good news, such as it is, would be that much of it is already in XML. The organisational problem of removing further skins from the onion, with sensible prioritisation, is certainly not insuperable. The CORE group (the bloggers in the LSE posting) has some answers, but actually not all that is needed for the text and data mining purposes they highlight. The long tail, or in other words the onion heart when it has become fiddly beyond patience to skin, does call for a pis aller. But the real knack is to do more between the XML and the heart.

Links
To subscribe to Facto Post go to Facto Post mailing list. For the ways to unsubscribe, see below. Editor, for ContentMine. Please leave feedback for him. Back numbers are here. Reminder: WikiFactMine pages on Wikidata are at WD:WFM. If you wish to receive no further issues of Facto Post, please remove your name from our mailing list. Alternatively, to opt out of all massmessage mailings, you may add Category:Wikipedians who opt out of message delivery to your user talk page. Newsletter delivered by MediaWiki message delivery
 * Crossref as a new source of citation data: A comparison with Web of Science and Scopus, CWTS blogpost 17 January 2018, Nees Jan van Eck, Ludo Waltman, Vincent Larivière, Cassidy Sugimoto
 * Citations with identifiers in Wikipedia, figshare dataset
 * Making women more visible online—with Wikidata tools!, Wikimedia blogpost 29 March 2018 by Sandra Fauconnier
 * Village pump discussion, Turn on mapframe? We’re ready if you are reaches conclusions
 * The Power of the Wikimedia Movement beyond Wikimedia, Forbes 28 March 2018, Michael Bernick
 * Tracing stolen bitcoin, blogpost 26 March 2018 by Ross J. Anderson
 * }