Wikipedia:OABOT

OAbot is a tool to easily edit articles to make academic citations link open access publications (see list of edits made).

Wikipedia links to hundreds of thousands of paywalled sources. Our community does not prohibit or even discourage citing paywalled sources, but at the same time there is absolutely no prohibition on surfacing open access (OA) versions right alongside those citations, as long as the link does not violate any copyrights. Indeed, a good citation will have as much information as possible to let the reader find (and use) it in the way that is easiest for them.

Workflow
The bot looks for CS1 citation templates, and for each of them:
 * parses the citation using wikiciteparser
 * queries the Dissemin API and Unsub with the metadata it has extracted
 * translate the  it returns to a parameter of the citation (arxiv, pmc, doi or url as a fallback)
 * if there is no such parameter in the template, and if no link is already free to read, it adds it to the template.

Examples

 * Adding a free to read url:
 * Before:
 * After:


 * Adding a citeseerx:
 * Before:
 * After:


 * Signalling openness of an existing DOI with free:
 * Before:
 * After:

Code

 * https://github.com/dissemin/oabot

You are very welcome to contribute to the code (for instance by pull requests on GitHub) and join the development team on wmflabs. You can request access to the Tools project.

If you want to make suggestions or report bugs, please add a task to the Phabricator project.

How does the bot work?
OABOT extracts the citations from an article and searches various indexes, apis, and repositories for versions of non-OA article which are free to read. OABOT can use the Dissemin backend to find these versions from sources like CrossRef, BASE, DOAI and SHERPA/RoMEO. When it finds an alternative version, it checks to see if it is already in the citation. If not there, it adds a free-to-read link to the citation. This helps readers access full text.

What kind of links does the bot add?
The bot adds a link with one the following parameters:
 * arxiv
 * hdl
 * doi
 * pmc
 * citeseerx
 * url

The bot only uses url if none of the other more specific parameters is known or applicable. The bot only adds a parameter if it does not contain anything before (so, the bot does not erase any information from the templates).

What kinds of links won't the bot add?

 * The bot won't add a link to a version not in CrossRef, BASE, DOAI, or SHERPA/RoMEO (it's not an open-web search for any version or pdf, it only draws from curated sources).
 * The bot won't add a link to an alternative version of a source that is already signaled as free to read (that is, if appears in the rendered source).
 * The bot won't generally replace an existing url with a different one, or add a second url.
 * The bot will ignore sources in free form: it only considers citation templates.
 * The bot will try not to add redundant links, such as links to publisher versions already linked through a DOI.

What repositories is the bot querying and pulling from?
The bot currently queries:
 * Dissemin. Dissemin relies on several sources, including Zenodo, ORCID, and BASE, see https://dev.dissem.in/datasources.html
 * Unpaywall (formerly OAdoi). Unpaywall crawls the sources listed at https://api.unpaywall.org/data/sources.csv

In the future we could add Internet Archive Scholar (or any others, like CORE, SHARE Notify, Handle.net, MLA CORE, CHORUS), once their indexes provide additional benefit and have a workable API.

What's the copyright status of the proposed links?
The bot adds links to gratis copies offered by repositories and publishers under a variety of licenses: some are not freely licensed or don't have a public license at all, for example bronze open access copies by publishers or some archival copies. Publishers and repositories obtain the right to do so in a variety of ways.

Our sources (listed above) only link reputable archives and open access repositories, typically run by libraries or research institutions, which are not known to violate copyright law. For example, under European Union copyright law, which is more restrictive than the copyright law of the United States, a secondary publication right or other copyright limitation exists in various countries (including Belgium, France, Germany, the Netherlands, Slovenia, Bulgaria), allowing repositories to obtain and provide a license. Such jurisdictions also tend to host the bigger repositories (like HAL).

However, mistakes are always possible. If you know or reasonably suspect a publisher or repository to have provided a work in error, do not link it.

Finally, publishers don't always endorse the existing laws of all countries, and may profess to have the right to prevent such archival efforts. You can learn the publisher's opinion from any copyright statements available at the DOI's location and from the SHERPA/RoMEO summary of each journal's policy.

For additional information see also:
 * Open access and copyright
 * A social networking site is not an open access repository

Why did the bot not add this identifier?
OABOT tries to perform the minimum changes required to make a citation open access.

The identifier you have in mind may not be known to provide an open access copy, or it may be one of many identifiers not currently supported. Alternatively, another identifier is present which already auto-links the title and guarantees the open access status of the work (most commonly it's PubMed Central).

Why did the bot remove a doi-access parameter?
The work is now considered closed access at Unpaywall, so we're no longer sure that the DOI actually provides a full text PDF. Usually this happens for bronze open access (gratis, non-libre) works, such as works temporarily made accessible at the height of the COVID pandemic.

The status of works with a free Creative Commons license or hosted by an open access repository tends to be more durable.

How do I stop the bot from removing a link?
As discussed above, the bot tries to avoid touching citations which already clearly provide an open access copy.

The best way to ensure a citation keeps linking your preferred copy is to add a direct link to an archived PDF or an open access repository identifier. For example, if you provide a PubMed Central identifier, cite journal will keep linking the PMC copy, which is often a publisher-provided copy of the published version, even if the doi-access parameter changes.

A publisher-provided copy can be linked more permanently by adding the URL of an Internet Archive preserved version, which can often be found through https://fatcat.wiki search or identifier lookup (or even a Google Scholar search): see example edit. If no archived copy is available, but the publisher provides a Creative Commons licensed copy, you can manually download that and archive it on Zenodo (Dissemin can be used for this; if you upload directly to Zenodo, don't forget to use the publisher's DOI, otherwise Unpaywall won't match the copy), and link the Zenodo copy in the URL parameter.

Why did the bot remove an URL?
The URL may be redundant with an identifier parameter (for example the DOI) or may need to be removed in order to provide the best known open access copy.

Many existing URLs need to be removed in order to be able to follow the recommendations for Convenience links and Access indicators for url-holding parameters. In hundreds of thousands of cases a redundant and paywalled URL has been added to cite journal due to a bug in VisualEditor/Citoid (T232771) and not a conscious choice by the person who added the citation.

In other cases, the URL may have changed, for example because an open repository changed URL structure (and we're unable to use handle.net identifiers for it) or because the canonical location changed (for example, a copy preserved by the Internet Archive may be reachable from multiple URLs under web.archive.org, archive.org or scholar.archive.org, as well as partnering libraries like biodiversitylibrary.org).

Why does the oabot tool make edits the bot doesn't?
The oabot tool allows users to perform edits which are not yet allowed for User:OAbot to run automatically, such as certain link removals or additions.

I am a publisher. How do I make sure OAbot recognizes my full texts?
You should make sure that
 * You comply with the Google Scholar guidelines for exposing your full texts. In particular, the landing page for articles that are free to read should contain the meta tag  with a direct link to a PDF file.
 * Zotero is able to import metadata and the full text from any landing page. This should be straightforward if you comply with Google Scholar's guidelines. Otherwise, you can fix the Zotero translator yourself by submitting a pull request to Zotero.

In addition, it also is useful if you make sure that
 * All your fully open-access journals are registered in DOAJ.
 * The CrossRef metadata includes the correct license for each article: it should be straightforward to tell whether the article is free to read simply looking at this piece of information.

Once you comply with these guidelines, the bot should mark your DOIs as free to read in Wikipedia, with a green lock:

I run a repository. How do I make sure OAbot can add links to my repository?

 * Get a valid OAI-PMH interface which should be harvested by BASE
 * Comply with the Google Scholar guidelines for exposing your full texts. In particular, the landing page for articles that are free to read should contain the meta tag  with a direct link to a PDF file.
 * Zotero should be able to retrieve metadata and the full text from any landing page. This should be straightforward if you comply with Google Scholar's guidelines. Otherwise, you can fix the Zotero translator yourself by submitting a pull request to Zotero.

I am a researcher. How do I make sure OAbot finds full texts for my papers?
Make sure all your papers are deposited in a mature repository (that complies with the guidelines above) such as Zenodo. You can use http://dissem.in/ for that. Other large repositories such as PubMed Central, arXiv or HAL will work too. The repository should give free access to the full text (not just the abstract). Records with ongoing embargoes are not considered.

Full texts stored on personal homepages will generally not be considered.

How many links should the bot add?
The bot only adds 1 link, even if it finds multiple alternative versions. For example, if OABOT finds a preprint on ArXiv and a post-print on a university repository, and a PDF on the author's website, then it chooses only one, based on a ranking algorithm in Dissemin.

What does the citation look like?
When the URL parameter is changed, the citation doesn't have any additional text or graphical elements, just an additional link.

Can we signal the version type (preprint, postprint, published version)?
At the moment, no. For most repositories this metadata just doesn't exist or isn't well-curated.

How can the bot be localized/globalized to work on any wiki?
The bot can function on any wiki, but it is limited by whether or not they use the CS1 citation templates and in the same way.

Edge cases for future development
OABot will find situations where there is already a url present which is not open, but the bot can locate a free-to-read version. In some cases we can add the secondary link as an identifier, but there are edge cases we need consensus on where the bot behavior is undetermined:
 * 1) When the url matches an existing identifier:
 * Say we have 10.1004/1543 and http://doi.org/10.1004/1543. Can we overwrite url to put a free-to-read repository there?
 * 1) When we can't match the url with an existing identifier but OABot finds a repository version:
 * For instance if we find http://www.sciencedirect.com/science/article/pii/S1535610816303981, we won't overwrite url automatically, but we would like to add the free repository URL somewhere else. If the free URLs we want to add stem from few repositories, is it appropriate to create templates for these specific repositories, and add them as id?

Next steps

 * Localize/Globalize bot on translatewiki.net
 * Integrate with Wikidata
 * Use a citation parser to format references without templates

Resources

 * Pywikibot framework documentation
 * wikiciteparser, a Python parser for citation templates based on mwparserfromhell
 * Wikicite – a way to record and track clusters of related articles

See also:
 * m:Research:Characterizing Wikipedia Citation Usage
 * |Wikipedia:OABOT Statistics of this page

People

 * Ocaasi (WMF), Jake Orlowitz, founder of The Wikipedia Library
 * Pintoch (talk), from the Dissemin project, main developer
 * symac, coordinator for French speaking TWL and pywikibot-owner
 * Andrew Su
 * James Webber
 * a3nm
 * Sckott
 * Christian Pietsch, responsible for APIs at BASE (search engine)