User:GreenC/testcases/autourl

Document describing auto-generated URLs, problems they cause for bots and suggested solutions.

For example given this template source:

Renders as:



Note the title is linked to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3210456 even though the URL is not in the template source ie. it is an auto-generated URL. Auto-generated URLs are generated by certain templates (such as ) when certain conditions are met. For example in this case when the url field is otherwise empty, and it has a doi specified along with a free. There are other conditions, and the conditions may be in flux as community consensus changes. Additionally, the conditions may be different on each wiki language site, as local communities have control over how templates work in their wiki.

Generally, auto-generated URLs should be respected by bots and not overwritten by the inclusion of a url, since a bot can not know which URL is better. Thus bots should detect the existence of an auto-generated URL before adding a hard-coded URL.

There are three possible ways to detect:


 * 1. Program the bot to match the template behavior (conditions) like described above eg. if there is no url and it has a doi and yes.
 * 2. Web-scrape the HTML of the Wikipedia page where the template is located and look at the HTML to see if an auto-generated URL was rendered.
 * 3. Use the MediaWiki API "parse" endpoint, convert the template into HTML and see if the title field has a URL attached.

The first is difficult and error prone as conditions may change at any time without documentation, and each 300+ language site may have different conditions. The second slow is slow to load and hard to parse. Third is most universal and stable, though a little messy parsing.

The MediaWiki API "parse" command for this template is: https://en.wikipedia.org/w/api.php?action=parse&text=%7B%7Bcite%20journal%20%7Ctitle%3DThe%20Discodermia%20calyx%20Toxin%20Calyculin%20A%20%7Clast1%3DEdelson%20%7Cfirst1%3DJessica%20R.%20%7Clast2%3DBrautigan%20%7Cfirst2%3DDavid%20L.%20%7Cdate%3D24%20January%202011%20%7Cjournal%3DToxins%20%7Cvolume%3D3%20%7Cissue%3D1%20%7Cpages%3D105%E2%80%93119%20%7Cdoi%3D10.3390%2Ftoxins3010105%20%7Cdoi-access%3Dfree%20%7Cpmid%3D22069692%20%7Cpmc%3D3210456%7D%7D&contentmodel=wikitext

The  is a urlencoded copy of the template with any newlines removed. In practice would also add  to get a pure JSON result (the above is HTML rendering of JSON for visual debugging purposes). The API can be run on other wiki sites by changing the domain for example tr.wikipedia.org for Turkish.

The JSON can be parsed to see if it contains an auto-generated URL.

Parsing is a two-step process.
 * 1) Extract the portion   and convert to "The Discodermia calyx Toxin Calyculin A" ie. the title text. This string can be identified beginning with   and ending at  . Remove the leading portion   and trailing  . It is urlencoded so next urldecode it. Finally HTML encode any "&" to  . Do not HTML encode the whole string just that character, though there may be others that need it (todo).
 * 2) Search if there is an   associated with the title string for example  . If so, it is known the title is already linked by an auto-generated URL.

Because API requests can slow a bot, it can be done late in the process after other checks have passed.

This method should work universally, in any language or wiki site, and remain stable regardless of changes to conditions that auto-generate a URL.