Wikipedia:Bots/Requests for approval/Full-date unlinking bot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol keep vote.svg Approved.

Full-date unlinking bot
Operator: Harej

Automatic or Manually assisted: Automatic

Programming language(s): PHP

Source code available: User:Full-date unlinking bot/code

Function overview: Removes links from dates.

Edit period(s): Continuous

Estimated number of pages affected: 650,000+, exclusively in the article space.

Exclusion compliant (Y/N): Yes

Already has a bot flag (Y/N): No

Function details: On Full-date unlinking bot, a consensus was achieved that full dates (a month, a day, and a year) should not be linked to in articles unless they are germane to the articles themselves (for example, in articles on dates themselves). In order to have the most support, the bot will operate on a conservative basis and will only unlink full dates. The details of operation and exceptions are available on User:Full-date unlinking bot.

Discussion

 * Anomie will no doubt want to have a look at the code. I'll ping him. - Jarry1250 [ In the UK? Sign the petition! ] 10:25, 23 August 2009 (UTC)
 * Surely there exists some estimate of the number of pages which need changing? 6 - 7epm would be right for a bot of this kind, I would think. - Jarry1250 [ In the UK? Sign the petition! ] 10:28, 23 August 2009 (UTC)
 * To compile a reasonable estimate would be resource intensive. Thankfully, the bot by design will only stick in the main space. And I don't see why the bot would go faster than 6-7 edits per minute, based on my own experience with PHP bots. @harej 10:38, 23 August 2009 (UTC)
 * Speaking of resource intensive you might want to put a few sleeps in the code so you don't kill the servers :) -- Chris 12:34, 23 August 2009 (UTC)
 * You have a point. While it won't edit more frequently than the usual bot, that doesn't mean it's resting between edits. I will work that in. @harej 17:51, 23 August 2009 (UTC)
 * A scan of the 20090822 database dump for the most common formats shows that 650,000+ articles contain full-date links. A throttle of 6-7 edits per minute would cover about 9,000 to 10,000 articles per day, assuming uninterrupted operation.  -- Tcncv (talk) 02:18, 24 August 2009 (UTC)
 * I don't think the code is intended to be ready for approval yet (correct me if I'm wrong here harej), but BAG comments would be good. It will edit a significant portion of all articles, I guess a few hundred thousand. --Apoc2400 (talk) 17:36, 23 August 2009 (UTC)
 * It's ready to be approved, and it's not ready to be approved. At this point, I have the basic framework down, and I am looking for input on what needs to be improved. (I know that bots compliance and handling piped links needs to be addressed). @harej 17:51, 23 August 2009 (UTC)
 * What sort of handling of piped links is needed? If the bot's purpose is to unlink dates that are linked for autoformatting purposes, you should be ignoring all piped links just as MediaWiki's autoformatting does. Anomie⚔ 18:44, 23 August 2009 (UTC)
 * User talk:Full-date unlinking bot points out different things people have done with regards to date links. @harej 18:54, 23 August 2009 (UTC)
 * But none of those will be auto-formatted. Wasn't autoformatting the point of the bot, with other things left for human attention? Anomie⚔ 18:57, 23 August 2009 (UTC)
 * Good point. I will remove the piped link support I added. @harej 19:17, 23 August 2009 (UTC)


 * The specification mentions an exclusion list in addition to bots/nobots. This doesn't seem to be present. Mr.Z-man 19:20, 23 August 2009 (UTC)
 * It is available here: User:Full-date unlinking bot. @harej 19:32, 23 August 2009 (UTC)
 * I'm referring to detail #6, "An exclusion list will contain the articles the bot will not edit. This list will contain the few article titles where a link to month-day and/or year within at least one triple date meets the relevance requirement in MOSNUM. (In these cases, it would be easier to edit the page manually in accordance with this at a later time.) Articles will be added to the list after manual review; there should be no indiscriminate mass additions of articles to the exclusion list. The list will be openly editable for one month before the bot starts running." Mr.Z-man 20:18, 23 August 2009 (UTC)
 * Ultimately what ended up happening was that rather than a full list of articles, we made a list of types of articles that the bot would exclude, as enumerated in the Exceptions list. @harej 20:27, 23 August 2009 (UTC)

Code review
Note that I'm not going to touch the consensus issues relating to this bot or any evaluation of its specification, I'm just going to see if the code seems to do what the specification at User:Full-date unlinking bot states. This review is based on this revision.
 * 1) It will fail to process pages linking to "January 1" and so on, because the code on line 140 will check "January1" (no space) instead.
 * 2) It doesn't check for API errors from $objwiki->whatlinkshere. Which means that, on an API error, it will try to process the page with a null/empty title (whether it will "succeed" or not I haven't determined) and skip all the actual pages linking to that month/day.
 * 3) It would be more efficient to pass "&blnamespace=0" as the $extra parameter to whatlinkshere.
 * 4) It would also be more efficient (and less likely to fail) to check the "ns" parameter in each returned page record from list=backlinks than to match a regex against the page title to detect namespaces. Of course, that would mean not using $objwiki->whatlinkshere.
 * 5) * I'd recommend doing both #3 and #4, just in case Domas decides to break blnamespace like he did cmnamespace.
 * 6) ** Is there a substantial chance of that happening? @harej 04:27, 24 August 2009 (UTC)
 * 7) *** I would hope not, but you never can really tell. Anomie⚔ 17:57, 24 August 2009 (UTC)
 * 8) Your namespace-matching regex is broken, it will only match talk namespaces. Which means the bot would edit any even-numbered namespace, not just the article namespace.
 * Resolved by getting rid of the namespace-matching regex altogether. @harej 04:27, 24 August 2009 (UTC)
 * 1) Your regex3 will match "1st", "3rd", and "4th"-"9th", but not "2nd".
 * 2) Your list of topics in regex3 and regex4 may or may not be comprehensive. It might be safer to just match all possible topics with ".+".
 * 3) Your regex3/regex4 will not match "intrinsically chronological articles" named like "1990s in X", or "List of 1994 Xs", or "List of 1990s Xs", or "List of 20th century Xs", or "List of 2nd millennium Xs", or "List of Xs in the 1990s", or "List of Xs in the 20th century", or "List of Xs in the 2nd millennium". There may be other patterns, those are just what I can think of off the top of my head.
 * As far as I know, I have fixed this. @harej 04:53, 24 August 2009 (UTC)
 * "List of Xs in 1990" is still missing, make "the" optional in regex4. Anomie⚔ 17:57, 24 August 2009 (UTC)
 * I made "the" optional, so now "List of Xs in 1990" will work. @harej 18:29, 24 August 2009 (UTC)
 * 1) For that matter, your regex3 won't even compile. It has unbalanced parentheses.
 * 2) checktoprocess doesn't check for errors from $objwiki->getpage either. Which could easily lead to the bot blanking pages.
 * 3) Putting a comment in the page to indicate that the bot processed it is The Wrong Way to do it. If someone reverts the bot they're going to be reverting your comment too, which means that it will completely fail in its purpose. You need to use a database of some sort (sqlite is easy if you don't already have one available), and I recommend storing the pageid rather than the title as pageid is not affected by moves. And even if no one reverts it, that means that hundreds of thousands of articles will have this useless comment in them for quite some time.
 * If the bot must confer with a database for each page to make sure it has not already been processed, how much will this add to the amount of time it takes to process a page? @harej 19:11, 23 August 2009 (UTC)
 * Negligible, really, especially if the database is on the same machine so network issues are minimized. Communicating with the MediaWiki API will take much more of your time, and even that will likely be dwarfed by the necessary sleeps to avoid blowing up the servers. Anomie⚔ 17:57, 24 August 2009 (UTC)
 * Come to think of it, I should rewrite all my scripts to interface directly with the database instead of with the API (though having it interface with the API makes it a lot more portable). Anyways, as I have said below, I am working to replace the comment-based system with a sqlite database. @harej 18:29, 24 August 2009 (UTC)
 * 1) $contents never makes its way into unlinker, as there are no "global" declarations.
 * As far as I know, I have fixed this. @harej 18:29, 24 August 2009 (UTC)
 * 1) You could theoretically skip "Sept" in the date-matching regular expressions, enwiki uses Sep. OTOH, I don't know whether any broken date links use it. Also, technically only "[ _]" is needed rather than "[\s_]".
 * I think our intent is to recognize all the cases that MediaWiki recognizes, plus more. This, cases like " Sept 1, 2009 " and " 1 January2009 " (no space) would be delinked and have their punctuation corrected.  -- Tcncv (talk) 01:50, 24 August 2009 (UTC)
 * 1) To match the comma in the same way MediaWiki does, use "(?: *, *| +)"; use " *, *" or " +" to match comma or non-comma, respectively. Also, you'll need "(?![a-z])" at the end of each regex to truly match MediaWiki's date autoformatting.
 * See above response. -- Tcncv (talk) 01:50, 24 August 2009 (UTC)
 * 1) You could probably do each of the replacements with one well-constructed preg_replace instead of with a match and loop.
 * See proposed change to replace logic here. -- Tcncv (talk) 01:50, 24 August 2009 (UTC)
 * 1) Your "brReg" regex is broken, it'll leave code like " [[1 January 2009 ".
 * Fixed by this edit. -- Tcncv (talk) 01:34, 24 August 2009 (UTC)
 * 1) Your "amOdd" regex is broken, "January 1 2009" is not a valid format for autoformatted dates.
 * Fixed by this edit. -- Tcncv (talk) 01:34, 24 August 2009 (UTC)
 * 1) Using strtotime and date will screw up unusual dates like "February 30 2009" which are correctly handled by the MediaWiki date autoformatting.
 * What would be a good alternative? @harej 19:11, 23 August 2009 (UTC)
 * See proposed change to replace logic here. -- Tcncv (talk) 01:34, 24 August 2009 (UTC)
 * ↑ That is the good alternative. Anomie⚔ 17:57, 24 August 2009 (UTC)
 * 1) Note that "(?!\s\[)(?:\s*(?:,\s*)?)" in the brOdd regex could be reduced to ",\s*", which doesn't match what MediaWiki's date formatting will match.
 * We are also looking to match cases with multiple spaces or spaces before the comma. -- Tcncv (talk) 01:50, 24 August 2009 (UTC)
 * 1) Your bot will edit articles solely to add your comment, which is annoying to anyone watching the page and generally a waste. When you change it to use a local database instead, the bot will waste resources making null edits.
 * I was not expecting this to be a problem, since only articles that appear on WhatLinksHere would work, but it would leave the comment if, say, August 23 was in the article but not accompanied by a year. That is the problem. @harej 19:11, 23 August 2009 (UTC)
 * 1) While it's unlikely to run into an edit conflict, I see you do no edit conflict checking. And given the nature of the bot, people will complain if it overwrites someone's legitimate edit even once.
 * 2) There is no error checking from the page edit either. It could be useful to log protected page errors and the like.
 * 3) It is not exclusion compliant.
 * 4) You really do need to add in sleep(10) or the like after each edit. And you should really use maxlag on all queries, too.
 * 5) The edit summary must be more informative than a cryptic "Codes: amReg". Explain what you're doing in plain English, and put the codes at the end.
 * Each edit summary links to a page about the codes. Obviously the page will be made before the bot begins to edit. @harej 19:11, 23 August 2009 (UTC)
 * Not good enough, IMO. Consider what a random person unaware of all this date mess will think when seeing edits from this bot in his watchlist. "Codes: blah blah" is much less informative than "Unlinking auto-formatted dates per WP:Whatever. Codes: blah blah". Anomie⚔ 17:57, 24 August 2009 (UTC)
 * Edit summary now begins with "Unlinking autoformatted dates per User:Full-date unlinking bot. Codes: ". @harej 18:29, 24 August 2009 (UTC)

All in all, this code needs a lot of work before it's ready to even be given a test run. Anomie⚔ 18:43, 23 August 2009 (UTC)
 * Some of these have already been fixed. Would it be okay if I struck out parts of your comment as problems are rectified? @harej 18:51, 23 August 2009 (UTC)
 * Go ahead. Anomie⚔ 18:56, 23 August 2009 (UTC)
 * Added some regex related responses above -- Tcncv (talk) 01:34, 24 August 2009 (UTC)


 * I agree with Anomie wrt the hidden comments in pages. Hidden comments may be removed by people who don't know what they're for, or if people revert the bot, they may revert the comment too, so they aren't reliable. Using a database should add only a trivial amount of overhead, much less than getting the page text. Also, if you run the bot from the toolserver, you could put the database on the sql-s1 server, replace the API backlinks call with a database query with a join on your own database, and not have to do a separate query to check if you've already processed it (you also wouldn't have to worry about blnamespace being disabled without warning if you use the database directly). Mr.Z-man 13:38, 24 August 2009 (UTC)
 * I have been looking into using an SQLite database backend to store the page IDs of pages that have already been treated. Consider it definitely a part of the next version. @harej 17:16, 24 August 2009 (UTC)

Comments on the revised code in beta release 3: Anomie⚔ 17:57, 24 August 2009 (UTC)
 * 1) The negative-lookahead in part_AModd_punct and part_BRodd_punct should be unnecessary, as the prior processing of the non-odd versions should have already taken care of anything the negative-lookahead group would match. OTOH, it doesn't hurt anything being there either.
 * You are correct. It is my indent to code the regular expressions with minimal dependence on each other, so that they can be tested individually to demonstrate that they match the intended targets and not match the unintended targets.  As you point out, it doesn't hurt anything, so I think I'll leave them in place. -- Tcncv (talk) 03:34, 25 August 2009 (UTC)
 * 1) One potential issue: " January 1, 1000_BC " will be delinked to " January 1, 1000_BC " (with an underscore in the year). The easiest fix might be a simple " $contents=preg_replace('/\[\[(\d{1,4})_BC\]\]/', '$1 BC', $contents); " before the other processing.
 * I looked at this as a possible independent AWB task and discovered that there are no cases to fix! I'll keep an eye out in case something slips into the database.  Your obviously thorough examination is welcome and appreciated.  -- Tcncv (talk) 03:34, 25 August 2009 (UTC)
 * 1) A theoretical 45 edits per minute (15 edits every 20 seconds) is still quite fast. This task is really not so urgent that the 6epm suggestion in WP:BOTPOL needs to be ignored. To be simple, just sleep(10) after each edit; to be slightly more complicated (but closer to a real 6epm), store the value of time after each edit and just before the next edit do "$t=($saved_time + 10 - time); if($t>0) sleep($t);" (and be sure to implement edit conflict detection!).
 * The sleep conditional does not indicate 15 edits every 20 seconds, but that it would make 15 edits (each edit taking at least two seconds, more if there are a lot of dates), then do nothing for 20 seconds. I am not sure on the math, but that is a lot slower than 45 edits per minute. @harej 18:29, 24 August 2009 (UTC)
 * You must have an incredibly slow computer there, if those few regular expressions are going to take several seconds to run. Anomie⚔ 01:28, 25 August 2009 (UTC)
 * "Several" is not the right word. However long it will take, I've since changed the code so that it will sleep after each edit, for simplicity's sake (and because I probably underestimate the speed). @harej 02:09, 25 August 2009 (UTC)

Beta Release 4
Here is the difference between Beta Release 3 and Beta Release 4. The two most significant changes are that it now rests for ten seconds after each edit instead of 20 seconds after 15 edits. This is plain easier on the server. Additionally, it will now keep track of pages it has already edited through page IDs recorded in a text file. I know I originally said that it would be an SQL database; however, considering how non-complicated a task storing a list of page IDs is, I figured this would be simpler. Of course this can change again if I was a dumbass for using a comma-delineated text file. And there are still more problems to address. @harej 21:56, 24 August 2009 (UTC)
 * I'm glad you went with the sleep after each edit. Comments:
 * I'd recommend going ahead with sqlite or another database, just because that has already solved the issues with efficiently reading the list to find if a particular entry is present and then adding an entry to to the list and writing it to disk.
 * It's a better solution, you're saying? @harej 02:43, 25 August 2009 (UTC)
 * It's IMO the right tool for the job. Anomie⚔ 12:13, 25 August 2009 (UTC)
 * Your "write back to the file" code needs to use "a" rather than "w" in the fopen; "w" will overwrite the file each time rather than appending the new entry as you intend.
 * It would be better if you can load the page contents and pageid in one API query rather than two.
 * Anomie⚔ 02:25, 25 August 2009 (UTC)


 * n.b. Comma delimited at the very least could be problematic for article titles with commas in them. Recommend database, but otherwise a better delimiter would be an illegal character, perhaps a pipe | . Jarry1250 11:58, 25 August 2009 (UTC)
 * Since it's storing pageids rather than titles, commas are not an issue here. Anomie⚔ 12:13, 25 August 2009 (UTC)
 * Oops. I thought I was missing something. Suitably struck. - Jarry1250 [ In the UK? Sign the petition! ] 12:21, 25 August 2009 (UTC)
 * If you simply iterate through page IDs starting with 6 to save time... then you will only need to store on e number.  Rich Farmbrough, 08:46, 27 August 2009 (UTC).
 * Starting with 6? Why 6? @harej 20:24, 27 August 2009 (UTC)
 * Page IDs 0-5 are not in main-space, if they exist at all. Rich Farmbrough, 18:16, 28 August 2009 (UTC).

You might also be interested in. Rich Farmbrough, 09:17, 27 August 2009 (UTC).
 * I know it says "articles" above, but can you please confirm explicitly that this bot will only edit dates in namespace 0? Stifle (talk) 13:33, 31 August 2009 (UTC)
 * The bot will edit namespace 0 (the article space) and nowhere else. @harej 18:08, 31 August 2009 (UTC)
 * Proof of concept. - Jarry1250 [ In the UK? Sign the petition! ] 19:15, 12 September 2009 (UTC)

Trial run complete
The bot has successfully performed 51 edits as part of the trial. See Special:Contributions/Full-date unlinking bot. Note that the bot's edit to August was due to an oversight in one of the regular expressions used to exclude article titles. I reverted the bot's edits, corrected the regex, and it should not be a problem anymore. @harej 23:57, 2 October 2009 (UTC)
 * Thanks and congratulations, Harej. I seem to recall that a second, larger trial is part of the plan. Is this correct? Tony   (talk)  12:03, 3 October 2009 (UTC)
 * Possibly, but I think I will need approval for that. @harej 15:48, 3 October 2009 (UTC)
 * Hopefully soon. Can you run off a list (from your database) of pages you've already edited? Then we could check that these ones have been ticked off. - Jarry1250 [ In the UK? Sign the petition! ] 10:57, 4 October 2009 (UTC)
 * Oh, and can it keep track of articles with mixed date formats in them? Cheers, - Jarry1250 [ In the UK? Sign the petition! ] 10:57, 4 October 2009 (UTC)
 * Special:Contributions/Full-date unlinking bot shows which pages were actually edited, but according to the list of pages that went through the whole process because the title did not disqualify it immediately, the bot edited the pages with the following IDs: 1004, 1005, 6851, 6851, 12028, 13316, 19300, 19758, 20354, 21651, 25508, 26502, 26750, 27028, 27277, 30629, 30747, 31833, 31852, 32385, 33835, 37032, 37419, 42132, 53669, 57858, 65143, 65145, 67345, 67583, 68143, 69045, 74201, 74581, 84944, 84945, 84947, 95233, 96781, 106575, 106767, 107555, 116386, 131127, 133172, 147605, 148375, 159852, 161971, 180802, 188171, 207333, 230993, 232200, 245989, 262804, 272866, 311406, 314227, 319727, 321364, 321374, 321380, 321387, 333126, 355604, 377314, 384009, 386397, 402587, 403102, 410430, 414908, 415034, 418947, 434000, 438349, 480615, 497034, 501745, 503981, 508364, 532636, 535852, 545822, 562592, 577390, 595530, 613263, 617640, 625573, 634093, 641044, 645624, 656167, 658273, 682782, 696237, 728775, 743540, 748238, 761530, 769466, 784200, 819324, 826344, 833837, 839074, 840678, 842728, 842970, 842995, 858154, 865117. @harej 16:00, 4 October 2009 (UTC)
 * I check a few, and it looks good. It got the commas right as well. --Apoc2400 (talk) 13:26, 9 October 2009 (UTC)

Request for a second trial
Considering that the first trial was successful, I am now requesting a second trial of 500 edits. @harej 16:22, 4 October 2009 (UTC)
 * , otherwise impossible to look from. - Jarry1250 [ In the UK? Sign the petition! ] 18:23, 9 October 2009 (UTC)
 * The bot has performed 202 edits in its second test run. In between runs, I changed the code to finally ditch the flat file database of page IDs in favor of a proper MySQL database of page titles, which has improved performance. Additionally, the bot will no longer try to submit a page unless a change has been made to the page. There are only three awry edits to report: a page blanking, another page blanking, and some weird jazz involving the unlinker that Tcncv should probably look into. The page blankings should not be a problem; I have reverted them and I have updated the code to make sure the page is not blank. @harej 01:15, 10 October 2009 (UTC)

Third trial
The issues of trial #2 have been rectified, and I forgot to place in a request for a third trial. I am requesting one now, and am also asking that contingent on the success of the third trial (and the outcome of the ArbCom motion), we begin considering letting the bot run full-time. @harej 01:02, 21 October 2009 (UTC)
 * - Jarry1250 [ In the UK? Sign the petition! ] 17:09, 21 October 2009 (UTC)
 * Completed. @harej 23:36, 5 November 2009 (UTC)
 * Also, the aforementioned ArbCom motion has passed. Dabomb87 (talk) 00:41, 6 November 2009 (UTC)


 * I have so far not found any issues with the last run of 500 edits, but will continue to go through the edits in the next few days. Naturally, I will flag anything unusual I find. What is the next step? Is there going to be a further trial, or will we submit application to let the bot loose? Ohconfucius  ¡digame! 04:59, 7 November 2009 (UTC)
 * If nothing weird comes up, we should consider letting this bot run full time with a bot flag. 759 test edits is enough. @harej 05:49, 7 November 2009 (UTC)


 * This one's an oddball, which I've seen only once -linking each fragment, but not doing it properly either. Not a bot problem as such, but it just shows the stuff people do. Ohconfucius  ¡digame! 10:33, 7 November 2009 (UTC)
 * Don't worry, I just dropped Rich Farmbrough a message about it. I've done a sample check of the last batch, and found no issues. Ohconfucius  ¡digame! 13:27, 7 November 2009 (UTC)

Full live run

 * However, particularly for the non-techies monitoring the progress of this thread because of its significance, it should be noted this "approval" is not a carte blanche. Bot operators are expected to monitor the activities of theirs bots and respond promptly and efficiency to criticism and mistakes, and, in the latter case, fixing the problem raised. This approval says that you can, not that you must. It is likely that ArbCom will be taking an interest, and hundreds of thousands of edits are unlikely to go unnoticed by the community at large. Please be careful. Thanks, - Jarry1250 [Humorous? Discuss.] 17:29, 7 November 2009 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.