Wikipedia:WikiProject Open/Open access task force/Signalling OA-ness/Report

Summary
This project aimed at prototyping a system that would allow signalling the openness of a reference cited on Wikipedia. Such a prototype with basic functionality now exists. It consists of a manually triggered but otherwise automated import of suitably licensed scholarly articles into Wikisource and of the associated media into Wikimedia Commons. In turn, this allow users citing that scholarly article to indicated what the license is and that copies of full text and media exist on Wikimedia platforms. As a complement to signalling the openness of cited articles, a gadget has been built that allows to signal, by way of the Open Access Button, how often access to a reference was denied by a paywall. The software produced in the framework of the project is openly licensed and available on GitHub. It has lots of dependencies on third-party progress, especially development of Wikidata and integration of other Wikimedia projects with it, which has to be taken into account for future activities.

Integration with Wikisource
A basic pipeline for automated import into the English Wikisource exists, though approval for importing directly into the main namespace remains a community issue. The main reason for this are inconsistencies (more on that below) in the XML that publishers deliver to PubMed Central (PMC), which may result in inconsistent formatting on Wikisource, or in images of equations or tables filling up the file namespace there. Another reason is mixed community support, in part because of such technical problems, but also because contemporary scientific literature is a rare sight on Wikisource, and so are import bots.

The current approach is to automatically import into subpages of WikiProject Open Access, which can then be used to fine tune the bot code (mostly to work around the inconsistencies in the XML source), and once the output of that reaches an acceptable level, some remaining adjustments are made by hand, and the article is manually moved over to the main namespace.

At PMC, a range of complex tools have been developed to handle the incoming XML in a very similar fashion (to allow for consistent HTML rendering), and some of these tools have begun to be shared with our project, which is expected to facilitate the handling of edge cases in the future.

One feature we had not initially thought of that was requested by the Wikisource community was the creation of author pages. These are pages that provide some basic information about authors of works hosted on Wikisource. For most of the works already on the site, much more information is publicly available than for most authors of articles imported from PMC, and disambiguation of authors is a challenge. However, integration of Wikisource (all languages) with Wikidata has started in early 2014, and in the long run, it can be envisioned to create such author pages on the fly based on information in Wikidata about authors and publications, especially if it includes unique persistent identifiers like ORCID or DOI.

At around the time of Wikimania London 2014, our bot was blocked from editing Wikisource. Luckily our presence at the event allowed us to converse with Wikisource administrators on how to proceed. This in-person contact was invaluable because we secured support of users that were previously adverse to our project in Community Portal discussions (the "scriptorium"). Eventually we were unblocked and given clear directives on what to fix in order to run fully endorsed.

Wikipedia Zero now operates via IP range. This provides another benefit to our project because it means that the Open Access articles that we are mirroring on Wikisource can now also be read free of charge under the Wikipedia Zero umbrella.

While the text of imported articles goes to Wikisource, the media files go to Wikimedia Commons, which is the common media repository of all Wikimedia projects. From there, they can be embedded into any MediaWiki page (or into blog posts, for that matter), including the respective Wikisource page. A special case are equations and tables, which many publishers supply as images rather than in a machine readable manner. Most of these would not be suitable for Commons, so the current approach is to upload them to Wikisource, where they can be transcribed.

Integration with Wikimedia Commons
The part of Recitation bot concerned with the upload to Wikimedia Commons is based on code from the Open Access Media Importer, which handles supplementary audio and video files and has been running since 2012. Recitation bot has been adapted to handle figures from both the article body and the supplementary materials. Supplementary files that are neither audio nor video nor images are simply linked to as external sources.

While some initial aspects of Wikidata integration are coming to Commons in December 2014, key features for our purposes are arbitrary access (expected for mid-2015) and integration of file metadata (probably 2016).

An additional complication arises from PubMed Central currently limiting the resolution of the images to which it provides programmatic access, which may reduce the usefulness of automated uploads from PubMed Central in comparison to full-size uploads from the article publisher's website.

Integration with Wikidata
For definitions of the Wikidata-specific terms used in this section, see Glossary.

The aim here was to pave the way for the metadata of scholarly references to become available through Wikidata in a way that would allow the metadata to be pulled into citation templates on pages across Wikimedia wikis.

On the basis of a sample article (Q15625490), an initial data model for this metadata is being drafted on Wikidata. It includes generic properties like article title (P357), authors (P50) and date of publication (P577), but also identifiers like DOI (P356), PubMed ID (P698) and PMCID (P932), along with information on licensing (P275), Commons category (P373) or language (P407).

When designing the data model, a number of editorial decisions have to be made. For instance, the email address (P968) of an author as stated in a paper could be seen as a property that should go onto the Wikidata item for an author and/or as a qualifier for the author property on the item for the paper. Likewise for affiliation (P1416), which comes with the additional complication that the affiliation given on a paper is often more specific than the scope of existing Wikidata items. For instance, the affiliation of the first author of the sample paper is stated as "Fish Division, Department of Vertebrate Zoology, National Museum of Natural History, Smithsonian Institution, Suitland, Maryland, United States of America" in the paper, which was mapped to Smithsonian Institution (Q131626) in the "affiliation" qualifier for the author statement on the item for the paper. Once Wikidata items exist for authors, they can (and should) be enriched to the extent possible, e.g. with identifiers like ORCID (P496), or with statements about affiliation, perhaps with qualifiers relating to certain points in time. These items can then serve as the basis for creating Author pages on Wikisource (example).

For statements about properties that relate an item to another item (as opposed to just requiring a value), the pages for those related items have to be created before the target statement can be made on the item about the paper. For example, stating that the article (Q15625490) was published in the journal PLOS ONE depends on an item for PLOS ONE having created (Q564954), which has itself to be populated with suitable statements in order to be useful in the wider framework of Wikidata. The same goes for authorship: authors can only be listed as authors on the item page for a paper if the item page for them has already been created (example: Q17496406). This dependency between items adds technical complexity to the upload of article metadata that goes well beyond a simple conversion from formats like BibTex. A tool for uploading bibliographic metadata to Wikidata on the basis of these specifications is currently being developed, and a mailing list has been set up for discussions around the topic.

Integration with Wikipedia
Our plan to integrate with Wikipedia, we decided, should happen through the use of Wikidata. As mentioned above, the core feature needed to make citing uploaded journals work was "Arbitrary access" which is still incomplete (see tracking bug). That means that currently to cite a paper and signal it's OA-ness in Wikipedia one must do it manually. Still this gives the benefit of "deep linking" - the practice of linking to a particular point in a page, not just to the page. So it's now possible to cite a specific sentence in an OA paper by creating an anchor in the paper and then linking in the Wikipedia article to that anchor. It's been a long requested feature on the web that "readers should be able to go directly (in a single click and in real time) to the specific part of the full text of source that is being cited". Converting existing citations into deep links is a difficult problem, but conversations with the developers of the new Peer Library have indicated that this may be able to be automated. It's useful to pause and remember that is the sort of feature that's only available with Open Access material, since a deep link cannot be completed through a paywall (unless logged in or browsing from subscribing IP address ranges).

Integration with Citation bot was not completed in time. In fact, Citation bot had failed before and during our work in this project, partly due to unmaintained software, partly because the environment in which it operates changed frequently (e.g. the various citation templates it serves, and their parameterization). This came to our attention but we did not have the resources to fix it in time. In addition, the template Cite doi that Citation bot uses for triggering became deprecated as well. We received Twitter messages asking us to get Citation bot's functionality running again, so further projects may want to resurrect this functionality, since they will garner favour in the community.

Open Access Button
We wrote a "Gadget" for Wikimedia sites to integrate with the Open Access Button. The Open Access button is a link displayed next to a citation that indicates if a paywall may be encountered and also a link to report a paywall. We are waiting for input from gadget approval process on English Wikipedia, so that the gadget can be activated as a standard user preference. There we exlpained: "This gadget runs on any article page and searches for any DOIs on the page. It then queries the open access button API for the status on these DOIs. The gadget displays the 'blocked' and 'reported' counts from the API. If these counts are zero it displays an button that allows reporting of a paywalled paper. If the counts are higher than zero it dislplays a closed access logo."

A few unresolved communications from the OA button team are whether they will do 'entity resolution' on the several links to paper that all have the same DOI? And also if they will change their model to allow anonymous reporting from not logged in users on Wikipedia, which is not currently supported.

CrossRef
CrossRef, the minters and resolvers of DOIs are starting to provide metadata on citable material. However, our chief interest, in licensing data, is not yet fully scaled. We would very much like to use this and it remains in our bug tracker as Issue 12. We are also collaborating with them on tracking the use of DOIs across Wikimedia platforms in real time as well as overall.

ORCID
ORCID, the Idenfier for researchers and other writers now has 1 million active ORCID IDs. A Wikimedian in Residence exists in their organization. As our project inserts Author Information it would be useful to also include an author ORCID ID, but so far large steps have been taken in that direction.

Icons
The Wikimedia logos have traditionally not all been available under an open license. This changed in October, when they were made available under CC BY-SA 3.0 (they continue to be trademarked). In order to provide the proper attribution that the license requires, any instance of such a logo on a Wikipedia page has thus to link back to that logo's file page on Wikimedia Commons. In order to use icons to link directly to the resource whose availability we want to signal (e.g. the Wikisource copy of the article, or the Wikimedia Commons category hosting the associated media), CC0/ Public domain images would be required.

A new WMF iconography has been worked on that would meet this requirement, but it has not yet been officially released.

Outreach
We created a video promoting and explaining our project that was uploaded to Vimeo and Wikimedia commons. Owing to it's being linked to in the EFF blog and also the Wikimedia blog, it received more than 2,000 views during Open Access Week alone.

A feature of reCitationBot is that it will send out a tweet on a dedicated account each time it is used. Since the tool was not in production quality until the end of the project, we never had time to promote the bot fully. Still, the work has been done to allow people to track our work through twitter.

We gave at least four talks about our the OA signalling project at various conferences, from CSVconf to Wikimania 2014. Our talks went over well and were covered by respective official and unofficial social media. Wikimania was a particularly fruitful conference for us because we were able to hook into the Wikisource meeting at Wikimania. Talking to Wikisourcerors in person allowed us to overcome roadblocks that we encountered when interacting only online (see above).

Travel

 * JATS-Con
 * Wikimedia Hackathon Zurich
 * WikiCon US
 * OKFest
 * Wikimania
 * Science 2.0 conference
 * PLOS hackathon
 * MozFest (e.g. ORCID)
 * See also events section on project main page


 * Shared Statement and Community Principles on Expectations of Scholarly Standards on Attribution
 * http://river-valley.zeeba.tv/principles-for-attribution/
 * Attribution recommendations at COASP
 * commentary
 * Cite-o-Meter
 * GitHub org WPOA; citation hackathon at PLOS; hack4ac; possibly hackathon at PMC Europe
 * Open Access Reader
 * best uncited
 * all cited PMIDs
 * discussion
 * collaboration with CrossRef on use of identifiers
 * OA books/ BITS
 * Topic Pages: Wikisource, Wikidata and icons; collection
 * NISO and Jisc workgroups on OA signalling
 * TIB proposal
 * See also
 * Annotations
 * PeerLibrary
 * On keywords: the PLOS Thesaurus is now public: https://github.com/PLOS/plos-thesaurus.
 * mwparserfromhell
 * tool Linked Items, which returns a sorted, de-duplicated list of Wikidata items corresponding to wiki links on a Wikipedia page, or in any piece of wiki-text.
 * http://ukwebfocus.wordpress.com/2014/08/28/links-from-wikipedia-to-russell-group-university-repositories/
 * http://alpha.richcitations.org/
 * e.g. ref. 24 in http://alpha.richcitations.org/view/10.1371/journal.pmed.1001700 or ref. 21 in http://alpha.richcitations.org/view/10.1371/journal.pone.0094597
 * see also http://api.richcitations.org/
 * blog post http://blogs.plos.org/tech/rich-citations/
 * https://github.com/PLOS/plos-thesaurus
 * Citenet demo
 * http://notconfusing.com/haskell-und-grepl-data-hacking-wikimedia-projects-exampled-with-the-open-access-signalling-project/
 * http://notconfusing.com/should-i-do-my-phd-in-the-open/
 * Making Open Access articles much more visible, automatically
 * Could such OA signalling on Wikipedia help scholarly authors find an open-access journal for their publications? example
 * Digital rights statements at Europeana
 * Journal Openness Index
 * top 575 licenses in the DPLA corpus
 * Shared Statement and Community Principles on Expectations of Scholarly Standards on Attribution
 * scholarly papers citing Wikipedia
 * On the Shoulders of Giants: The Growing Impact of Older Articles
 * 1411.0275
 * second-order effects:
 * OA signalling on Wikisource etc.
 * OA signalling of data citations
 * wikiversity:In the Lands of the Romanovs: An Annotated Bibliography of First-hand English-language Accounts of the Russian Empire (1613-1917)
 * discussion of possible switch in reference formatting

Finances

 * within budget