Wikipedia:Bots/Requests for approval/DOI bot 2


 * The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Symbol keep vote.svg Approved.

DOI bot 2
Operator: Smith609  Talk 

Automatic or Manually Assisted: Automatic

Programming Language(s): PHP w/basicbot

Function Summary: Add missing parameters to citations from CrossRef database, and tidy citations

Edit period(s) (e.g. Continuous, daily, one time run): Complete run through of all pages transcluding cite journal, with "on demand" editing of individual pages where applicable

Already has a bot flag (Y/N): Yes

Function Details: This bot originated as a tool to add DOIs to citations. I later discovered that the CrossRef database also contains citation details (title, author, year, journal etc), so tweaked the bot to add these parameters too where they were missing. If the CrossRef database contradicts the information in the article, the bot will stick with the data already in Wikipedia, and assume the error to be with CrossRef. Consensus appears to be that specifying a URL parameter is also useful; the bot can specify the URL that the DOI redirects to and in some cases make an intelligent guess as to its nature (abstract, fulltext etc) which can be recorded in the "format" parameter.

There have also been requests for the bot to correct common mistakes, such as replacing "id = PMID 123" with "pmid=123", percent-encoding parameters within dois so they link correctly, and replacing erroneously capitalised parameters (example: "Journal=Science" with "journal=Science"). Since these seemed uncontroversial I implemented these as I went, but my sense is that an official approval would placate some of Wikipedia's adminsitrators.

Because of the uneconomical way the bot's code has developed, it is actually simpler for it to clean up citations as it goes, removing duplicated parameters (some of which have been created by the previous run of DOI bot). In cases where there is more than one instance of a parameter, the bot will remove:
 * 1) If one or more are empty, the empty one
 * 2) Any identical duplicates
 * 3) If two non-identical values are present, it will use the one which appears later in the citation: this is the one which is used by the parser software when rendering the template.

I should make a final note to anyone reviewing the bot's edits that I recently re-wrote a large portion of the bot's code to make it more efficient, and to respond to comments on my user page; this was significantly buggy, and I unwittingly left the bot running on more test pages than I intended. This accounts for the errant nature of edits after those with edit summary "Testing new better-mannered editing." An embarrassing mistake, for which I apologise - all of these edits were reverted as quickly as I could.

Discussion

 * Note: Just to mention that most controversy about DOI bot seems to boil down to personal opinions about the rendering of cite journal. To avoid getting sidelined, please remember that how a DOI or URL is displayed in a citation is of no consequence to this bot; the current appearance of the cite journal template, as well as the parameters which are deemed worthy of inclusion, represents the current consensus - and at the end of the day is of limited importance.  Discussion here should be restricted to potential problems that fulfilment of the tasks listed above may cause. Smith609  Talk  09:50, 6 May 2008 (UTC)
 * Can you run a test on a sandbox page and post the diff here? I think this would give a clear picture of what exactly this bot does to cite tags. --Lemmey talk 16:24, 6 May 2008 (UTC)


 * The bot's not quite functional at the second - I've not quite finished patching a couple of bugs, and am going to wait until I have an idea of how much of this request will be approved (and I have time). The bulk of its edits will resemble this one: ; minor tidying happened here:  and addition of a missing parameter can be seen here: . Smith609  Talk  17:53, 6 May 2008 (UTC)
 * P.S. here's an edit which demonstrates its full potential.
 * PPS: Now the bot's running in "sandbox only" mode, here's an example of its current activity. Smith609  Talk  18:48, 10 May 2008 (UTC)


 * I am a bit confused by the current function details listed - it says "The bot will never edit a manually added parameter." then it talks about fixing common mistakes, and removing duplicate parameters, both of which appear to be editing parameters. Some clarification would probably help.  (If it will edit existing parameters, specifying what sorts of edits it will make.)  Or is there some way it can distinguish manually added parameters from those added by itself or some other bot, and saying it will only correct the actions of bots?
 * Removing duplicates sounds like a good idea. Might be nice to clarify how define duplicate (e.g. lexically identical).
 * Also, removing tags with empty parameters if the same tag with filled in parameters is also present might be handy. Zodon (talk) 01:37, 7 May 2008 (UTC)
 * Sorry, I'm dreadful at being ambiguous/self-contradictory! I've re-worded the description appropriately. Smith609  Talk  09:51, 7 May 2008 (UTC)

Adding URLs to nonfree articles?
The description is a bit vague, so a complete example would help. One question: the usual style in articles I edit is that url= is reserved for articles where the entire text is freely readable, and that url= is not used for articles where just the abstract is readable (for that, you can just live with the DOI or PMID or whatever). Will the bot support this convention? That is, on such articles will it refuse to add URLs to articles that aren't entirely readable? Eubulides (talk) 09:58, 7 May 2008 (UTC)


 * I envision this being a possible bone of contention. I envision the bot providing a link where only an abstract is visible, but marking the URL as "abstract" or "subscription required" (using the "format" parameter).  The rationale for this is that casual readers may not understand that a DOI or PMID provides a link to the article, and that a title link is intuitive to follow.  The bot can't really tell whether editors have only chosen to provide URLs to free texts, you see.


 * The bot will use two approaches to determining the nature of the page; firstly, if the page returns a "403 Access denied" header, it will log it as "subscription required"; secondly, if the url of the page contains the string "abstract" or "/abs" it will take it to be an abstract. If it receives a "400 Found" header and there is a "/fulltext/" in the URL, I think that's sufficient to specify "free full text" in the format. In the absence of these clues, the bot will leave the "format" parameter blank.  This is very much open to discussion, though!  Does it sound unproblematic to you?


 * Smith609  Talk  10:45, 7 May 2008 (UTC)


 * The style I prefer, which is used in several articles, is that url= is used only for citations to articles that are freely available (not just abstracts, but the whole thing), and that citations to non-free articles must content themselves with DOIs or PMIDs. This is a very common style: it's suggested in the documentation for Template:Cite journal. For such articles, if the DOI bot finds an citation with |doi= but without |url=, it should not add an URL unless the URL is to a source that is freely readable (article body and all).
 * A casual user who sees something like "" won't have any trouble figuring out that the blue links in the DOI and PMID provide information about the article when clicked upon.
 * Eubulides (talk) 23:54, 7 May 2008 (UTC)
 * Re your latter point, it's easy to say that speaking as someone who is used to reading journals etc, but comments by LouScheffer on my talk page (see section "DOI bot removing existing URLs?") suggest that this perhaps isn't always the case. In the majority of articles I edit (which tend to be scientific rather than medical), the convention seeems to be to provide a link, whatever - but then I guess that DOIs are rarely specified. I guess the crux of the matter is whether the title being linked is a genuine help to users, which was the sense I got from discussions on my talk page - I guess each of us has our own entrenched opinion that we're unlikely to change, so it would be helpful to get some views from the wider community! Smith609  Talk  07:56, 8 May 2008 (UTC)
 * I quite agree that articles use different styles. Even within medical articles there are different styles; for example, Tourette syndrome is a featured medical article that does not use Template:Cite journal. The DOI bot should work fairly well with all the major styles used in Wikipedia. For Tourette syndrome that's trivial. Asperger syndrome uses the style I mentioned, so on that article the DOI bot shouldn't add URLs to citations that lack them, unless it knows the referenced papers are freely readable. For the style of articles you tend to edit, the heuristics might well differ and the DOI bot could be more aggressive in adding URLs. Eubulides (talk) 19:50, 8 May 2008 (UTC)

Proposal from AN
This is the proposal, somewhat amplified regarding editing DOIs, that I posted in AN. I believe it was supported in principle by a consensus there. (I removed the point about rendering of the citation, which was correctly noted as beyond the scope of the bot.) The first limitation is to keep the bot from trying to edit citations which do not involve DOIs. Whether that was a function of a code problem on May 4, or people trying to misuse it, the bot code should refuse to edit any citation that is not to an actual journal whose publisher has included it in the DOI system, even if an editor used the cite journal template. The second limitation was discussed at length on AN. The third, fourth, and fifth items are the only things the bot should be doing. And the last limitation enforces accountability of editors, since if a manually-invoked edit is attributed to the bot, and not to an editor, there's no way to tell who made it. (Suggestions for implementing that include making a user-script version of the bot, or having the bot prepare a pre-filled edit box that the user could review and click, as some other tools do.) --MCB (talk) 21:50, 7 May 2008 (UTC)
 * 1) The bot may only edit citations where the reference is to a publication included in the DOI system or referenced in PMID.
 * 2) The bot must not remove or alter an existing URL.
 * 3) The bot may add a DOI or PMID reference to a citation.
 * 4) The bot may fix a syntactically broken DOI or PMID reference.
 * 5) The bot may update an erroneous DOI or PMID reference.
 * 6) If the bot is used as a tool in "manual mode", the resulting edit must be attributed to the user who invoked the bot.


 * The comments against Point 1 on the admin noticeboard received no reply; the point provides an unwarranted restriction on the bot. Please see my note above about the edits of May 4.
 * Point 6 (point 5 on WP:AN) appears to stem from a misunderstanding about how the bot works, and is completely unnecessary. As far as I can tell, it makes absolutely no difference whether a page is edited as a result of the bot adding it to the job queue, or a user adding it to the job queue; the same edit results, and is made by the bot.  By your logic, every edit the bot makes to pages I've added to its queue - i.e. all 45,000 pages transcluding "Cite Journal" - should be attributed to my account, inflating my edit count, and flooding the watchlists of people who ignore bot edits.
 * Finally, your statement that "3, 4 and 5 are all the bot should be doing" is not defended, in light of the proposal listed here.
 * Smith609  Talk  09:33, 8 May 2008 (UTC)

NB. Since the "doilabel" parameter has been rendered obsolete by a recent change to cite journal, it would make sense for DOI bot to remove these parameters where they appear. Smith609  Talk  20:12, 9 May 2008 (UTC)
 * I'm afraid I don't understand your objection to Point 1. Under what circumstances would the bot edit a citation where there is no journal that is included in DOI/PMID, and why?


 * I am assuming that you agree to Point 2.


 * As for Point 6, the idea is that if an automated tool is invoked by a user, the edit belongs to the user, and the accountability for the edit in terms of choice, wisdom, consistency, and compliance with policy belongs to the user. (Obviously, programming the bot by the bot owner -- i.e., telling it to process everything with cite journal) -- is not what is contemplated by this). But if a user chooses a page and a particular citation or set of citations, the responsibility is the user's, even though the work is done by the bot. I'm not sure why you are so opposed to that. It's how other user tools work.


 * I am also very disturbed by the comment on your user talk page, "Hi, [the bot has] been disabled for bureaucratic reasons. I hadn't realised it was still useful in its crippled state! It's back running again now." First, that you seem to dismiss the policy issues discussed on AN as "bureaucratic reasons", and more importantly, that you are still running the bot even though it has not been unblocked (apparently it has some function that is not disabled by a user block), and seem to be pleased about that fact. I'd like to ask you to take immediate steps to fully disable it pending the outcome of this; otherwise I'd see it as an attempt to evade the block. Thanks, MCB (talk) 22:12, 9 May 2008 (UTC)


 * Some day I might manage to communicate without confusing people - sorry!! Firstly, to reassure you that the bot is still completely unable to edit pages - LouScheffer was running the bot in order to see the DOIs it found, and I'm glad that he's found this useful. It absolutely  won't edit the wiki (with the exception of userspace, which I will reactivate soon) until the full bureaucratic process of bot approval has been undergone, and until I am satisfied that the bot is not going to be causing upset by its actions.


 * An example of when the bot may edit a citation that doesn't have a DOI is to correct capitalisation errors, for example replacing "Journal=Science" with "journal=Science" so the citation displays correctly.
 * Implementing point six would be beyond my programming skills. I also feel it would be misleading - once a page has been selected, a user does not have any control over the bot's edits, nor do they check them first - unlike editing tools, which (as I understand it) make suggestions of edits, which the human editor decides whether to implement or not.  I'd be surprised if you could provide an example of where it would be a genuine advantage to know which editor had informed the bot that a page could benefit from its attention.  To provide a disadvantage (the waste of my time aside), what would happen if the bot needed blocking again? It would still be able to make edits in the guise of other people's accounts.
 * Cheers, Smith609  Talk  23:52, 9 May 2008 (UTC)
 * Sorry to misunderstand you! I should not have jumped to conclusions, and I apologize for making assumptions based on that. About editing citations, I apparently overlooked the feature of what I think you'd called "tidying citations" which should be harmless. What I guess I'm getting at is what happened with the edits on May 4 where the bot edited citations that were web sites or newspaper articles where an editor had erroneously used cite journal -- leaving aside the code errors, which I assume are or will be fixed, I'm not sure what was supposed to be done to to those citations.
 * About the edit attributions, I'm not sure what can be done. I certainly mean no criticism of your programming -- it's undoubtedly lightyears ahead of my abilities -- but I do know that there are user scripts that propose edits that the user ratifies, or submits them directly (alas, I don't know how to do that). Alternatively, does the bot know who invoked it? Because if so, it could simply note that in the edit summary, which would have the same ultimate effect, essentially an "audit trail". But I seem to be the only person worried about this, so perhaps it is not that important overall given the self-limitations of the bot. Best, MCB (talk) 06:42, 10 May 2008 (UTC)
 * It would be possible to ask users to identify themselves when using the form to initiate the bot, but without requesting their password (which is A Bad Thing To Do for a number of reasons), it couldn't be verified. If you think this would be useful, I'd be happy to implement that.  The edits of May 4th, as I've mentioned, were a result of me misapplying a patch, and the bot running when I thought it had turned itself off.  The "fixed" bot had a successful tinker with my sandbox; for a representative edit of some previously troublesome citations that LouScheffer has kindly collected, see this test run; and here's a  "before" and "after" of some citation tidying.


 * If that's addressed all your qualms, I'm now almost ready to start testing the bot on "real" pages: if a member of the Bot Approval Group were able to check over the comments above and ensure that there's nothing potentially harmful or unwanted that I've overlooked, that would be most appreciated! Smith609  Talk  07:54, 10 May 2008 (UTC)
 * BAGAssistanceNeeded


 * βcommand 2 20:06, 11 May 2008 (UTC)

Trial
Hi, I've just about reached the 50 edit ceiling; I've been mainly editing cite doi subpages while I get fine details tweaked (and because it's saving me time with citations I'm adding!). This has been very useful as I'm now comfortable it's not going to cause extensive damage to established pages, but I should like to try the bot on some articles and established citations before I am confident letting it loose on its own. Is it possible that the trial could be extended to, say, 200 edits, to give the bot a chance to encounter a wider range of "non-standard" citations? Thanks, Smith609  Talk  07:57, 13 May 2008 (UTC)
 * Sure - Continue testing as needed. :)  krimpet  ✽  06:35, 16 May 2008 (UTC)
 * Just for reference, would it be possible for you to post a list that summarizes exactly what the bot does, functionally? Presumably each section of code does something specific, so I'm hoping that's something relatively simple to do. Thanks, MCB (talk) 18:13, 13 May 2008 (UTC)
 * Hi, a comprehensive (I think) list is now available at User:DOI bot under "function details".
 * On reflection, there are a couple of things that may be worth noting here:
 * The bot replaces "url=http://dx.doi.org/#" with "doi=#" - I think this was the one URL manipulation deemed okay.
 * The bot and the template cite doi have been evolving in parallel. The current solution may be "a very bad idea", bad enough to seriously affect performance, so on the advice of User:Betacommand, I have been thinking of alternative ways to implement it. Smith609  Talk  07:33, 14 May 2008 (UTC)

Looks great. I definitely endorse the URL replacement in #1 since it makes Wikipedia less dependent on doi.org specifically, which was one of my original concerns. And several sections above, when I wrote something like "may only edit citations where the journal is in the DOI or PMID system", I think that's dealt with by limiting the bot to use of CrossRef; if something is in CrossRef I think we can assume it is a legitimate part of the DOI system. Great work and thanks for being patient with me and the discussion here. Best, MCB (talk) 19:00, 14 May 2008 (UTC)

Trial results
Hi, the bot has made somewhere in the order of 200 edits since it was fixed. I'm satisfied that operation is now bug-free enough to consider the trial a success; if there's anything I've missed (i.e. not marked as fixed at User:DOI bot/bugs) do please let me know! Thanks, Smith609  Talk  11:27, 23 May 2008 (UTC)


 * BAGAssistanceNeeded: Would someone mind taking a look at this bot and giving it the official stamp, please? I'd like to roll it out into full opration asap. Thanks, Smith609  Talk  13:43, 26 May 2008 (UTC)
 * (though I note one issue at User:DOI bot/bugs marked as under investigation; please take it slow until that's fixed.) dihydrogen monoxide (H2O) 10:35, 27 May 2008 (UTC)


 * The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.

Great, thanks a lot! Smith609  Talk  10:51, 27 May 2008 (UTC)