Wikipedia:Bots/Requests for approval/DyceBot 4


 * The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Symbol keep vote.svg Approved.

DyceBot
Operator: Dycedarg   ж 

Automatic or Manually Assisted: Automatic and unsupervised

Programming Language(s): Python using the Python Wikipediabot Framework

Function Summary: Rename pages with hyphens in numeric ranges to use en dashes instead per the MOS

Edit period(s) (e.g. Continuous, daily, one time run): Multiple runs, probably at least a few thousand at a time

Already has a bot flag (Y/N): Y

Function Details: Someone made a bot request requesting that a bot be designed to replace hyphens with en dashes in numeric ranges, as that is required by the MOS. I think that a general fix for occurrences of this in article text could be integrated with AWB's general fixes and propagated by SmackBot and regular AWB users, but numeric ranges in page titles could not be fixed in that manner. I did a preliminary search in the most recent database dump for date ranges (first thing that came to mind), and found almost 7000 of those needing to be fixed alone. I'd do an initial run for date ranges, and follow up with runs covering any other types of numeric ranges in titles I may discover requiring this in the future, and also run the same searches for bad titles when the next database dump comes out. Details of implementation: After I compile and check over a list of pages requiring fixing, the bot will test the page to make sure it's not a redirect (in which case it was moved by someone else since the database dump was made), use some simple regex to replace the hyphen with an en dash, make sure that the destination title is free, and then move the page and fix all resultant double redirects. All the moves that failed for some reason (move protection, destination page otherwise occupied, etc.) would be dumped into a list for me to check over manually later.

Discussion
Sounds useful. Are there any legitimate uses of a hyphen between two series of numbers? — Werdna talk 12:10, 1 May 2008 (UTC)


 * Yes. I believe ISO numbers use hyphens, as do ISBNs. Hence when I said I was going to check through the edits before making them. The first run would just be date ranges, after that I'm going to hunt through articles using hyphens for broad categories of those that should be using en dashes instead. Additionally, it was pointed out to me that there are other things besides numbers that misuse hyphens, such as the country-country relations articles, and country-country border articles. I'm now planning on fixing those too. My basic method will be to come up with regex that will match articles that need to be fixed, and then check through them for false positives before running the bot.-- Dycedarg  ж  19:55, 1 May 2008 (UTC)

What about going for a regex like /\b\d{2,}\-\d{2,}\b/ to match anything that looks like 2005-2006, without matching ISBNs (i.e. ISBN 1-84356-028-3), or anything which includes only one number on one side or the other (such as ISO 691-3). — Werdna talk 12:10, 2 May 2008 (UTC)


 * The regex I used to compile the list of dates I have was rather similar, as I recall it was something like /\b\d{4}\-\d{2,4}\b/ (as I have never seen someone use two double digit dates for a date range in an article title). Yours catches stuff that couldn't be a date like IEEE 754-1985 and Meanings of asteroid names (90001-91000) (although the asteroid names articles may require changing, I'll have to look into that), and even mine catches stuff like ISO/IEC 8859-15 and ISO 3166-2:2002-05-21. So it needs weeding either way.-- Dycedarg  ж  18:49, 2 May 2008 (UTC)

Remove any of them that are preceded by a word in all-caps? — Werdna talk 00:42, 3 May 2008 (UTC)


 * That's basically what I'm doing to remove them. Unfortunately, I can't do that as an absolute rule due to the potential for the name of a sports league (like NHL or NFL) to appear next to a year.-- Dycedarg  ж  23:47, 3 May 2008 (UTC)

Ah well. You can always add exceptions. Any objections to a trial run? — Werdna talk 04:13, 4 May 2008 (UTC)


 * There's a conversation that needs to happen first, because changing hyphens to en-dashes when an editor doesn't expect it will mean that the next time they search for something they wrote, they won't find it. I'll initiate a quick discussion at WT:MoS, and then if people are okay with it, post notices (say, in WP:MoS and on the WikiProject noticeboard and the Community Portal).  I note that this  bot is intended only for page titles, and the potential objection can be fixed with redirect pages, but as noted above, it's a short hop from doing this to use of AWB to accomplish the same thing in text, which is why discussion and notification is needed, I think. - Dan Dank55 (talk)(mistakes) 14:15, 4 May 2008 (UTC)


 * Ah, who would have to search for something they wrote? Wouldn't it be in their contributions list or their watchlist? And don't article titles with en dashes explicitly have to have redirects from the erroneous hyphenated version? I don't see the problem.
 * This bot is long overdue, and I commend the developer. Date ranges are a good start; would it be possible to do page ranges—in reference lists—at the same time? If not, you might consider them to be the second target-set (be careful not to change the very occasional hyphen in code etc.). Other uses of en dashes are more subtle; unfortunately, most misuses of hyphens for en dashes in article titles are of this more subtle type. See MOS on dashes for a very good explanation.
 * Dan seems to be an inveterate en-dash antagonist. Perhaps he's using a strange font that disguises the difference between them and hyphens. TONY   (talk)  14:38, 4 May 2008 (UTC)


 * "Ah, who would have to search for something they wrote?" Everyone who writes for a living.  As the saying goes among professionals, there is no writing, only re-writing.  And I am not an inveterate anything, nor antagonistic, nor against en-dashes; I'm sorry if I didn't make that clear before.  I completely support the current WP:MoS recommendation on use of en-dashes, and I have made the corrections at FAC's and GAN's.  I support the idea of revisiting the discussions concerning all characters not found on keyboards roughly once a year, for the simple reason that all such characters are slowly dying out in "persuasive" (not sure what I mean by that) English writing, because so much content is migrating to the web these days as the primary place where it lives.  We don't have to change our style the moment other publications do; we can and should be conservative.  But we should keep an eye on developments.
 * And I agree with Tony that, if we're going to make these conversions, they should be done with a bot. But there needs to be discussion, and it needs to be done carefully, and people have to be notified.  Notification is especially important when the proposed substitution is one that a majority of editors won't even notice or remember, unless they've been made aware of the issues. - Dan Dank55 (talk)(mistakes) 15:40, 4 May 2008 (UTC)


 * Most search engines seem not to distinguish the difference so I don't think this is a likely problem. It will be more relevant when a bot comes along to change hyphens in the body of articles rather than just the title, as browser find functions do tend to notice the difference. Christopher Parham (talk) 16:02, 4 May 2008 (UTC)


 * The usage of dashes has been discussed again. And I mean recently; 23–27 April is not exactly old. The basic counter-arguments to the position that en dashes are dying out are there, although I could probably find a few more if asked to. Waltham, The Duke of 18:02, 4 May 2008 (UTC)

You're stating that the primary reason you want this discussed place is what people might do with AWB? The potential use or misuse of AWB in areas relating to this task are absolutely none of my concern. I don't have any intention of doing it now, and if for some unforeseen reason I do later I'll worry about that then. What other people do is their problem. Take it up with them if/when it happens. It could not get added to the list of general fixes without a certain amount of consensus seeking regardless of what I do here anyway. As for the task itself: Watchlists follow moves, you'd end up with both the redirect and the page under the new name on your watchlist. There would be no difficulty in finding pages, the redirects may not be deleted as dictated by the MOS. Most people who have been around for any substantive amount of time are used to pages they write being moved at this point anyway, if they aren't then an introduction to the concept is in order. The moves have consensus because the MOS has consensus, as verified by the link provided by The Duke of Waltham. If you feel announcements are necessary, I will make them. But unless someone comes up with tangible objections I would appreciate some trial approval soon. Oh, and response to Tony concerning page ranges in reference lists: As I stated in above, this is only for page titles. If you wish to get page ranges in reference lists fixed you'll need to wait until someone comes up with a separate bot for main-text hyphen misusage or it's added to AWB's general fixes.-- Dycedarg  ж  22:03, 4 May 2008 (UTC)


 * As far as I know, changing some hyphens to en-dashes in page titles by bot is a good idea, since page titles come under Naming conventions and other policy pages, as long as we keep the redirect pages, but I'd like to give people a couple of days to say if they know if this will cause unforeseen search problems. As to the idea that because something is in a style guideline, that means that anyone can write a bot to enforce it in the text (not the titles) in all 2.3 million articles ... well, please don't tell people that MoS-editors made you do it.  Nowhere in WP:MoS do we pretend to be policy-makers. - Dan Dank55 (talk)(mistakes) 22:24, 4 May 2008 (UTC)


 * What do you think the people who write the general fixes for AWB do all the time? If we're not willing to follow our own style guidelines then we shouldn't bother having them. But I digress, as per what you said this comes under naming conventions anyway and that is policy, and it says follow the MOS on dashes (I had rather thought it fell under some policy). I do not see how any unforeseen search problems could arise; the redirects take care of what gets typed directly into the MediaWiki search box, and I have seen no evidence that the Google engine places any more importance on the title of the page than on any other part of it. If anyone would be aware of such limitations, a developer would, and Werdna has been commenting on this for days without any indication that this might be a problem.-- Dycedarg  ж  23:06, 4 May 2008 (UTC)


 * Wikipedia talk:MoS is the place to discuss the role of MoS. We haven't had any really contentious arguments for a while, and I don't think this would turn into a contentious fight, either.  If this is something you feel strongly about, jump in, at the section I mentioned above or in a new one. - Dan Dank55 (talk)(mistakes) 23:14, 4 May 2008 (UTC)


 * But what is there to discuss? The role of the Manual of Style in this case is known and accepted. Your search concerns have been addressed. Why should we delay the approval of this most obviously useful and beneficial bot with unnecessary sidesteps? If you really think the bot should be known to a wider audience, a note in WT:MoS referring editors to the discussion here would suffice. Waltham, The Duke of 23:29, 4 May 2008 (UTC)
 * The role of the Manual of Style, as often, is to enshrine the prejudices of a handful of editors. But I can conditionally approve this, provided it is never used on text. Efforts to use it on country-country articles will probably result in a distressingly large error rate; one of the meanings of Indo-Chinese must take a hyphen, and the other really should. 23:36, 4 May 2008 (UTC)
 * I can not foresee ever expanding this particular task to include page text, and should I do so I give you my personal guarantee that I will, as policy dictates, go though another BRFA. Should anyone else decide to do so, you will have plenty of opportunity to object assuming they go through proper channels ahead of time. As for the country-country thing: The regex will be very specific; it will only cover word-word relations and word-word borders. I will be very careful to peruse the resultant lists carefully for false positives and if I'm not sure about a particular title I'll skip it; if I'm unable to come up with a sufficiently stringent regex I'll skip that part altogether. Also, the phrase Indo-Chinese is contained within a grand total of one page name anywhere in Wikipedia.-- Dycedarg  ж  00:23, 5 May 2008 (UTC)


 * Okay, on that reassuring note, I don't have a problem with deployment right away. If I hear something we haven't thought of yet, I'll let you know. - Dan Dank55 (talk)(mistakes) 00:38, 5 May 2008 (UTC)

Manderson's same old mantra? What about his prejudices? TONY  (talk)  02:34, 5 May 2008 (UTC)
 * Um, is this the start of a productive discussion relating to the bot? If not I'd rather you took it elsewhere. Pmanderson has not explicitly opposed this particular bot (as I never intended to edit article text), and any general disagreement you two have on the role of the MOS, it's applicability for future bots that are not this bot, anyone's biases or lack thereof, or anything similar would best be pursued on Wikipedia talk:MOS or your own talkpages.-- Dycedarg  ж  02:56, 5 May 2008 (UTC)


 * Um ... thanks, but mind your own business Removed my rude comment. Manderson raised his own little pet peeve that does impinge on the current issue. TONY   (talk)  03:01, 5 May 2008 (UTC)


 * It always does... (exasperated sigh) Waltham, The Duke of 17:29, 5 May 2008 (UTC)

Fine with me. Anybody object to a trial? — Werdna talk 10:02, 6 May 2008 (UTC)


 * No objections from me, although... I just remembered this: will the bot address the requirement, in some titles, for spaced en dashes? These will be relatively few, as most date ranges in titles comprise years alone, but it is an important parameter to consider nonetheless. Even though, now that I think about it, the bot will probably only recognise hyphens between numbers, disregarding month-year compounds, which will have letters at least on one side of their respective hyphen. Waltham, The Duke of 01:31, 7 May 2008 (UTC)
 * For the first thing I'm doing, it will be just years and this won't be an issue. Month-year compound ranges will be a later task (assuming there are any in titles at all, which I don't know yet), and as they all need to be spaced it won't be an issue there. I did a preliminary regex search for the country-country border/relations articles, and there are few enough of those for me to be able to separate them into spaced and unspaced lists manually. In summation, the bot won't be doing this itself, but I will be running it on lists either requiring or not requiring spaces individually, and running it in the correct mode for each.-- Dycedarg  ж  02:44, 7 May 2008 (UTC)


 * Thanks. A particular problem is such titles as "Mexican-American war"; looking forward to that fix. TONY   (talk)  03:41, 7 May 2008 (UTC)

OK, all questions answered, there are no objections. Can I have trial approval now?-- Dycedarg  ж  07:51, 9 May 2008 (UTC)BAGAssistanceNeeded

—  Ree dy  20:36, 11 May 2008 (UTC)
 * lets see a 20 article trial. βcommand 2 20:01, 11 May 2008 (UTC)
 * ✅. I went by Betacommand's request for a 20 article trial, which ends up as more than 20 edits (article move+talk page move+fixing double redirects for each move).-- Dycedarg  ж  00:06, 12 May 2008 (UTC)
 * Quick note: The bot will be editing faster when I start doing this in large numbers, for the trial I didn't bother adjusting Pywikipediabot's default throttle. I'm thinking somewhere around 10 epm.-- Dycedarg  ж  00:11, 12 May 2008 (UTC)
 * Could I have some feedback? The edits are good (they are what I intended at any rate), and I'd appreciate approval at some point. Or a response directing what must be done to gain it. I have received no complaints relating to the edits the bot made, either from those who expressed reservations above or anyone else.BAGAssistanceNeeded-- Dycedarg  ж  05:53, 14 May 2008 (UTC)

Hmmm. This idea seems a bit : - / Using en-dashes has generally been discouraged from page titles because they're difficult to type. Regardless of what the MoS says, moving thousands of pages to a (relatively) more inaccessible title seems like a bad idea. It creates needless redirects and is just generally confusing to the average reader. The example listed above (Mexican-American War) has been moved a couple of times between en-dash and not. Consensus seems to favor not using special dashes in page titles. --MZMcBride (talk) 06:01, 14 May 2008 (UTC)
 * Using en-dashes does not make pages more inaccessible. Not in the slightest. The average reader is redirected upon typing the title into the search bar, and would either not notice or not care about the message at the top of the screen informing them of this. The average editor has no reason to care about or type the title of the page they're editing, with the sole exception of finding it via ctrl-f on their watchlist, which is a minor issue at best. If they wish to link directly to the title of such an article instead of it's redirect, it's as easy as copying and pasting, which would be a good practice to encourage anyway as en-dashes should ideally be used in their proper context as often as possible. Redirects are cheap, and common enough as to not be confusing. Finally, further research into the issue of the article Mexican-American War reveals that the revision of the naming conventions linked to as the rationale for the latest page move stated that using special dash characters in article titles was forbidden due to technical restrictions, technical restrictions which I have no reason to believe still exist. Should someone demonstrate that this is still a serious issue, then the point would obviously be moot. Otherwise the title for that article is no evidence for anything. Absent that restriction, the naming conventions, which are policy, state that the MOS section on dashes is to be followed. Unless you can demonstrate to me that consensus no longer exists or never existed for that part of the policy, I have little reason to refrain from doing this on consensus-based grounds. As a side note, if you want to know why I want to do this and am going to such effort to do so, look at Category:Ligue 1 seasons, which contains the articles I moved for my trial. The dashes look better; with the hyphens the numbers are too close together. It is much more visually appealing and easier to read, which is in my opinion a far greater boon to the readers of this encyclopedia than the redirects will be a burden to anyone.-- Dycedarg  ж  08:20, 14 May 2008 (UTC)


 * Totally agree with Dyce. MZMcBride, no, "Mexican–American War" has to have an en dash; whoever moved it back again, I just can't believe the simple-mindedness of it. This issue was resolved last year during an extensive debate. TONY   (talk)  09:29, 14 May 2008 (UTC)

Wouldn't this bot make creating redirects more difficult? Since any redirect would have to be to the en dash version of the article, but some may not notice the difference, and default to the glyph corresponding to a keyboard key. (This would then create a double-redirect.) — PyTom (talk) 06:43, 17 May 2008 (UTC)
 * Anyone creating a redirect would indeed need to copy and paste the name of the page directly, which I for one do anyway to prevent misspellings and because it's easier. Due to the rarity with which redirects need to be made, I do not see this as being a particularly problematic issue. Any double redirects accidentally made would show up on Special:DoubleRedirects and get fixed by bots in any case.-- Dycedarg  ж  06:53, 17 May 2008 (UTC)


 * Where's the hold-up? Waltham, The Duke of 12:49, 22 May 2008 (UTC)
 * Indeed; as one well-known WPian says: "Just do it". TONY   (talk)  16:12, 22 May 2008 (UTC)


 * The trial moves were on pages with virtually identical titles, I'd really like to see a variety of different titles to make sure there aren't going to be false positives. (100 articles). Mr.Z-man 00:05, 23 May 2008 (UTC)
 * How is this coming along? SQL Query me!  07:11, 4 June 2008 (UTC)
 * Sorry about the delay, as noted on my userpage I was out of the country for the past couple weeks. As I am now back, I will be running the 100 article trial tonight during off-peak hours.-- Dycedarg  ж  22:18, 4 June 2008 (UTC)
 * Didn't see, sorry, thanks for getting back to me so quickly! :) SQL Query me!  03:59, 6 June 2008 (UTC)

Trial done. Sorry for the additional delay, the page moving function of Pywikipedia is broken and I had to fix up a workaround. I think it ended up being less than 100 articles, as the list I pulled them from was generated using the dump before the most recent one as I didn't have the latest one downloaded yet, so some of the pages might have been moved since the dump was generated. Anyway I think it should be a wide enough sampling for you to judge for yourselves the effectiveness of my safeguards against false positives.-- Dycedarg  ж  02:48, 11 June 2008 (UTC)

So, feedback? Can I have approval now?-- Dycedarg  ж  11:54, 13 June 2008 (UTC) BAGAssistanceNeeded
 * Yep, all seems good. You're up! giggy (O) 12:04, 13 June 2008 (UTC)


 * The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.