Wikipedia:Bots/Requests for approval/Disambot


 * The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Symbol neutral vote.svg Request Expired.

Disambot
Operator: — xDanielx  T/C\R

Automatic or Manually Assisted: Both. Simple remedies are applied automatically; pages with more complex issues are listed on User:FuBot/Potential atrocities for human review.

Programming Language(s): Python (custom library for interfacing with MediaWiki)

Function Summary: The robot will clean up disambiguation pages in accordance with Manual of Style (disambiguation_pages).

Edit period(s) (e.g. Continuous, daily, one time run): I plan to start with short periods of editing. If all goes well, eventually I'd be running it for several hours at a time.

Already has a bot flag (Y/N):

Function Details:
 * Ordered list items are replaced with unordered list items.
 * Pages with external links are listed for human review.
 * Punctuation is stripped from the end of disambiguation items, unless there are multiple sentences involved (in which case it lists the page for human review).
 * If the beginning of an item contains a bold wikilink, the boldness is stripped (italics are left in place, per the MoS).
 * Where piped links are found at the beginning of a disambiguation item, the pipes are removed and just the title of the article is used as a link (i.e., the string to the left of the pipe).
 * Links within the description of an item are stripped unless the item has no link (or a red link) at the beginning, in which case the page is listed for human review. If the link is piped, the alternate text (i.e., the string to the right of the pipe) is used to preserve to coherency of the description fragment.
 * There's a bit more, but I think you get the idea. :-)

Discussion

 * Note -- I gave the bot some very short runs (a few edits at a time) for testing purposes. I figured this would be okay since I was reviewing each edit and reverting small mistakes within a minute or so. — xDanielx  T/C\R 06:59, 19 June 2008 (UTC)
 * As far as tests go, I am concerned by the bots two earliest edits. and  Why did you have the bot do that to mainspace articles?  Testing is all well and good, but, do so in userspace, please, especially if you plan on completely replacing live content with "test".  SQL Query me!  07:28, 19 June 2008 (UTC)
 * I'm still not sure why, but the sandbox kept throwing me a MediaWiki:Deletedwhileediting error. I've been watching MarketTools since creating it and it hadn't had a human edit in 7-8 months, so I figured I couldn't do any harm there if I reverted the test edit immediately. In hindsight, I should have used my userspace for the test. I apologized to the two editors who reverted before I could (within the minute). — xDanielx  T/C\R 19:11, 19 June 2008 (UTC)
 * I don't think removing piping of links is a good idea. It will mean that the sentence the link starts will no longer make sense. Why are you removing links with in the description? -- maelgwn - talk 23:41, 19 June 2008 (UTC)
 * Regarding the links at the beginning, the style guideline is that items should have the form " link, description fragment " -- so theoretically using the article's full title should very rarely be a problem. I know this isn't always followed -- full sentences are often used -- but they usually come in forms like
 *  A subject is . . . 
 *  The subject was . . . 
 *  In pyschology, subject is . . . 
 * The bot won't strip or de-pipe and any links in these cases -- it only does so when there is a blue link at the beginning. I reckon there will still be occasional cases where the bot's de-piping would be harmful, like changing " John Doe is an actor " to " John Doe (actor) is an actor ", but I think these would be infrequent. Perhaps I could have the bot check for redundant rhetoric (like "actor" in the example) to avoid some of these cases.
 * Regarding removal of links in the description, it's just what the MoS prescribes. As I understand it, the reasoning is that multiple links may be distracting to the reader, while having just one link at the beginning of each item makes for easier skimming. But again, to be safe, the bot removes these links only when it finds a working (that is, blue) link at the beginning (otherwise the page is listed for human review). — xDanielx  T/C\R 00:41, 20 June 2008 (UTC)

I think most of the logic here is pretty well thought out, but I am concerned about the "infrequent" cases where the depiping might cause problems. Since we're talking about editing a rather huge number of articles (~100,000), infrequent cases are going to add up to a significant number. Perhaps that particular function should be bot-assisted (with human oversight) rather than automatically executed. I would also like to see several test runs of increasing length before we turn it loose on all 100,000 disambiguation articles. Obviously such a huge task needs to be undertaken with great care so as not to create more problems than it is fixing. Kaldari (talk) 18:43, 21 June 2008 (UTC)
 * Hm, perhaps as a compromise I could have the bot go ahead and remove the pipes, but record the changes in a log every time it does so? Then I could post the logs after each run, review them (perhaps with help from others, depending on how big the logs end up being) and fix any bad changes we come across. I think that would be easier for us than making the edits ourselves, as we wouldn't need to click around, wait for pages to load, or search for a particular line within each page. If it ends up requiring multiple editors, we could just cut large chunks out of a log and review them in a text editor so that two people don't check the same logs redundantly. Does that sound workable? — xDanielx  T/C\R 19:47, 21 June 2008 (UTC)
 * Only if there are actually editors committed to doing the review. Kaldari (talk) 15:19, 23 June 2008 (UTC)
 * I'll make sure all the review is done before starting a new run, mm'kay? I should be able to finish by the end of summer if I can average around 1,700 articles per day. So if one DAB page needs ~2 de-pipes on average (very rough estimate), and I can review about one per second, 50-60 minutes per day should suffice. I think I can cover that, but if it turns out to be too much, I'll just get help or extend the schedule instead of letting the backlog pile up. — xDanielx  T/C\R 22:20, 23 June 2008 (UTC)

I think this is a great idea, but there are a few things that it is proposed to do that I would have a hard time being comfortable with it doing. People do ignore all rules on disambiguation pages, and even though fixing piping, for example, might make the page conform to the manual of style for disambiguation pages, it might make the entry really confusing. If we can work out some kind of editor review, perhaps that will work, but for now, I'd almost rather have the bot mark the pages for cleanup so that an editor can do the cleanup. A few specific concerns: That may have sounded sort of negative, but I really am excited about this! A bot to do some disambiguation page cleanup could really help make cleaning up the pages easier for editors. I just want to try and cover all the bases, so it does the least amount of harm. Thank you for writing this bot up! -- Nataly a 11:34, 24 June 2008 (UTC)
 * If it is going to remove bolding from the beginning of lines, it would need to not un-bold the introductory line and any subsequent section headers.
 * There is some legitimate piping that goes on on disambiguation pages; when the article is a book title or album title or anything that should be italicized, the link will be piped like Title (novel) . There are always some random occurances where it makes sense, too.
 * For the removal of extra links from the lines, I feel like there are enough weird ways that people put disambiguation pages together that just removing the links without looking at it could be harmful. Sometimes people will put two valid links in the same line.  Now, that definitly needs to be fixed, but removing one of the blue links sort of does actually take away from the information.  This is another case where unless there's a way to work it out, I'd almost rather the bot just tag the page for cleanup, or mark it on one of your subpages as a page with excess links (or something).
 * Thanks for your input and encouragement. I think a couple of the points you bring up are already covered by the code --
 * The bot shouldn't mess with introductory lines, unless they begin with a * or #. (Hopefully these will be rare.)
 * Xyzzyplugh brought up the italics issue on my talk page; I've updated the code so it should preserve italics in DAB items (but not bold).
 * Sorry if these points weren't clear in the function details -- I thought I would bore everyone if I included too much detail. The other points you raise are accurate. — xDanielx  T/C\R 22:35, 24 June 2008 (UTC)


 * Cool - thanks for that information. Glad those issues will work out well. -- Nataly a  10:57, 25 June 2008 (UTC)

Please use the bot to mark pages for cleanup, rather than fixing them. I have been doing disambig cleanup for a month or two now, and my guideline has been exactly the algorithm outlined above. I need to do something other than the guideline suggests at least 1 in 10 pages (for a total of 10,000 horrible mistakes if the bot is run as proposed). However, whenever I have found a disambig page so ... unique ... that I didn't feel I could handle it, I marked it with disambig-cleanup, and someone always fixed it to perfection within a day. I would definitely support a disambig-routine-cleanup template and category to be added to pages where the bot thinks it could do it auotmatically, and have the bot add disambig-cleanup if there are weirder problems. Once the bot has classified all 100,000 disambig pages, then we can decide if the Disambig WikiProject can handle the backlog manually or if we need a bot. WikiProject Stub Sorting eats 10,000 stubs for lunch (ok, takes a week or so), so I think it should not be too hard to finish the cleanup using humans. I find the routine disambig cleanup to be extremely relaxing. JackSchmidt (talk) 15:44, 24 June 2008 (UTC)


 * (response to both Natalya's and JackSchmidt's concerns). First of all, I believe it would be totally pointless to have a bot which did nothing but mark pages for cleanup.  At the moment, we have at least 60,000 (out of 100,000) disambiguation pages which require cleanup.  What would be the possible use of having a bot mark 60% of pages for cleanup?  The way to clean up DAB pages right now (without a bot) is to simply pick a spot in the alphabet, start opening up pages, and the majority of them will need to be fixed in one way or another.  There is no way to make this more efficient if done by hand, a bot marking many of these pages (it couldn't flag them all for cleanup) would be a total waste of time.


 * As to the removal of extra blue links, I have stated elsewhere and will state here how I believe this could be done by a bot in a manner which would be accurate probably 99.9% of the time. If the page is Voodoo, for example, and if an entry begins with a blue linked term which contains the word "voodoo", and if there are other linked terms in the entry, but none of them contain the word "voodoo", then the other terms can safely be de-linked.  A bot which did nothing but this would end up removing unwanted blue links from tens of thousands of DAB pages, and might possibly inappropriately de-link something one time in 1000 or 5000 or whatever.


 * As to piping, I agree that we shouldn't just have the bot de-pipe the blue linked term every time, as there will be many occasions where the piping is appropriate. However, I can easily state a set of rules which would allow the bot to remove probably 95% of inappropriate piping, while virtually never unpiping something that should have stayed piped.  Basically, almost all inappropriate piping consists of either turning something like Eugene, Oregon to Eugene, or turning Voodoo (album) to Voodoo .  So, if you simply have the bot look for piping where the blue linked term contains the page title, and where piping is being used to remove text after a comma or to remove text surrounded by parentheses, you can safely have the bot remove piping (the bot could also easily make sure to not mess up italics, so that Voodoo (album) would stay that way). I don't know how many pages contain inappropriate piping that the bot would be fixing, but I'd guess 10,000+.  Once again, I think following these rules the bot would only be messing something up one time in thousands, which is an acceptable rate of error.  --Xyzzyplugh (talk) 19:15, 24 June 2008 (UTC)

A more conservative proposal
Many valid concerns have been expressed here, but I'd really like to avoid making all (or most) edits manually if possible. I propose that we let the bot make the edits, but keep a detailed log of all changes made. The runs would be kept relatively short, and a new run wouldn't be started until all the old logs had been cleared.

My motivation for suggesting this is it would save us a lot of clicking around, waiting for pages to load, and scanning for particular lines. With the log review method, it would just take a few keystrokes: shift, down, delete. (Or shift, down, down, down, down, delete.)

It wouldn't be much trouble for me to keep track of how many changes of each type the bot makes, and how many are reversed upon human review. Once we had a good-sized sample to look at, we could request that certain types of changes be excluded from the logs to make review easier. (I'm not sure how this would work logistically if this RFBA was inactive -- perhaps we could post a request to Wikipedia talk:Bots/Requests for approval?) Even if we had to keep reviewing logs of all types, I think it would be a big time saver.

Of course, complex issues (like external links, or DAB items with multiple sentences) would still just be listed at User:FuBot/Potential atrocities as originally planned -- the bot wouldn't attempt to fix these since the success rate would presumably be too low.

— xDanielx  T/C\R 23:53, 24 June 2008 (UTC)
 * I definitely like the new proposal (reviewing runs before more runs are made). I would like to make a unique suggestion. Since there are indeed so many different (and unforseen) cases of disambiguation pages, I would like to suggest that we create a disambiguation page obstacle course for this bot to test run against as one of it's first runs. xDanielx, could you create a page where people add complex disambiguation pages to a list and then once the list is sufficiently filled out, you could run against it? I already have a few pages in mind. That way maybe we could knock out a lot of the bugs right off the bat. Kaldari (talk) 23:23, 25 June 2008 (UTC)
 * Hm, interesting idea. I started User:fuBot/Obstacle course for this. — xDanielx  T/C\R 23:39, 25 June 2008 (UTC)
 * That seems like a good (and sort of fun!) idea. As for the general proposal, I like the idea of some sort of tracking method; as long as it can be relatively easy for a human to go through the bot's edits and make sure nothing weird happened (and if it did, fix it), the weird occurances don't worry me as much.  At least for me, if I get a list of things that need checking/fixing, I don't mind taking a look at it, and knowing that the bot has conceptually already done a lot of the work makes me actually sort of look forward to going back through any changes it wasn't sure about and reviewing. -- Nataly a  01:13, 26 June 2008 (UTC)

Oh! Do you think that you could get it to removed dashes and replace them with commas? Like, change: to Could we run into weird problems with this, or would it be pretty innoculous (like removing periods)? Also, what about removing indented bullets? As I look through all the sad disambiguation pages in the cleanup category, I keep seeing things that it seems like the bot might be able to do. I hope I'm not throwing too many things out there! I'm just excited. :) And, do you think it would be able to detect (and not fix, but like, add to a list for a human to go through and fix) when a page located at "Term (disambiguation)" didn't have a primary topic link? -- Nataly a  01:15, 26 June 2008 (UTC)
 * Foo - a type of plant
 * Foo, a type of plant
 * Good idea -- I think it'd be mostly innocuous as long as it only made changes where a dash proceeded a link at the beginning of an item (i.e., as long as it didn't mess with dashes elsewhere). Though it may create some stylistic inconsistencies by changing a page from dashes only to a mixture of dashes and commas... not a major problem but something to consider. About missing primary topic link -- it would be hard for the bot to detect the links consistently, but I guess I could write something to identify some of them using common patterns (like "[A/An/The] ___ is a/an/the ___"). Couldn't hurt, anyway. — xDanielx  T/C\R 02:43, 26 June 2008 (UTC)
 * For the primary topic link, would it be possible for the bot to detect if there's a link to "Term" on the first couple of lines (taking into account wiktionary links) on a page "Term (disambiguation)"? I don't know if that would make it easier for the bot to do it (My minimal programming experience may just allow me to make really dumb comments!). -- Nataly a  11:18, 26 June 2008 (UTC)
 * Good idea -- that'd probably be more consistent (and simpler) than what I was thinking of implementing above. — xDanielx  T/C\R 21:50, 26 June 2008 (UTC)

DAB pages are, in general, a mess, and I'm a big fan of coaxing them into following our neglected DABMOS. In the past I've wracked my brain trying to figure out ways of using a bot to fix dabs, but when I tested them all my ideas broke pages as often as they fixed them. If you can get something to work, more power to you. I'd strongly support allowing a small test run (e.g. 25 pages), since this will, more than anything, help xDanielx get a feel for the sort of vagaries and inconsistencies that the bot will have to deal with. (You may have to revert, recode, and retry often, I'm afraid.) – Quadell (talk) (random) 14:07, 6 July 2008 (UTC)

Name
Without expressing an opinion on whether the bot should be approved, could you change its unfortunate name? People might not take it very well when they're reverted by the Eff You Bot.  r speer  / ɹəəds ɹ  05:28, 6 July 2008 (UTC)
 * This bot is going to be reverting editors? BJ Talk 05:29, 6 July 2008 (UTC)
 * Not sure if you meant that rhetorically, but if not -- the bot wouldn't exactly be reverting, though I suppose stripping links (selectively) could be viewed as undoing editors' work in some sense. — xDanielx  T/C\R 06:56, 6 July 2008 (UTC)
 * I meant it as an illusion to FUBAR, but I can see how editors might interpret it that way, especially given the sonant association. Perhaps I could just make it more clear with a note on User:fuBot? (I'm fuBot, the Effed Up Bot.) — xDanielx  T/C\R 06:56, 6 July 2008 (UTC)
 * I have to chime in with advocating renaming. Disambot? xBotx? It just that people are often silly about such things. – Quadell (talk) (random) 14:06, 6 July 2008 (UTC)
 * Disambot is kind of fun. And not a bad idea to make it clear what the bot is doing (or, Quadell's other suggestion mimicing xDanielx's name).  But, xDanielx, do make sure it's something that you still like, as we heap suggestions on you. :) -- Nataly a  16:45, 6 July 2008 (UTC)
 * Disambot sounds great. :-) I'll switch to that -- thanks, Quadell. — xDanielx  T/C\R 22:14, 6 July 2008 (UTC)

Current status
Seems to me this should be approved for a trial of 50 or so, so we can look through the changes and see how well it works. Are there any objections to this? – Quadell (talk) 13:09, 14 July 2008 (UTC)
 * The BRFA hasn't been renamed to reflect the rename, I was waiting on that. BJ Talk 13:18, 15 July 2008 (UTC)
 * Sorry; done.
 * Unfortunately I'll be away between the 16th and the 26th, so I may not be able to complete the trial edits in a timely manner. I look forward to doing so upon returning, though. — xDanielx  T/C\R 17:02, 15 July 2008 (UTC)

Wanna give him a trial approval as a homecoming present? – Quadell (talk) 19:05, 23 July 2008 (UTC)


 * "There's a bit more, but I think you get the idea." - Um, no. Please provide all the changes it will make. I would not be at all comfortable approving a bot if I only knew a small subset of the changes it would be making, especially to pages in mainspace. During coding some may seem obvious and mundane, but they may turn out to be controversial or prone to false positives. Mr.Z-man 13:15, 27 July 2008 (UTC)


 * A bot like this takes in a wide and varied array of contradictory and non-standard styles, and tries to improve them however possible. In these cases, a small trial seems like the best way to figure out what changes need to be made. (xDanielx, if you need to experiment with real data before you can come up with a solid set of changes, you might want to copy random DAB pages into subpages of your userpage, and run those. You can't cause too many problems that way; that's what I've been doing for Polbot 8, which similarly tries to improve various non-standard elements in articles.) – Quadell (talk) 16:28, 27 July 2008 (UTC)

We're going to need more details about which changes this bot will make. Be as specific as you can. – Quadell (talk) 14:36, 6 August 2008 (UTC)


 * Sorry for the slow response. I think I listed all the changes it makes in the affirmative sense. I just left out some restraints -- for instance, the bot will skip very short lines, and it will not remove punctuation when it recognizes an abbreviation (e.g., "John Doe is the CEO of SomeBusiness Corp."). It also may change whitespace, but not in a way that affects the (post-rendering) visual output. I posted the source at User:Disambot/Source if anyone wants to inspect it.
 * Unfortunately time constraints are becoming an issue for me. I was hoping to do this over the summer, but I'll be going to college in ~10 days now. Would anyone be interested in taking over the role of operator? — xDanielx  T/C\R 06:36, 7 August 2008 (UTC)


 * Does said person need extensive (or, actually, any) knowledge of bots? I would be glad to help out, but unfortunately I know next to nothing about how they actually work.  -- Nataly a  10:52, 7 August 2008 (UTC)


 * Yeah, I think they will, since the code will need to be modified to accommodate new-found problems and solutions. I'd offer, but I'm really not good with python. – Quadell (talk) 13:49, 7 August 2008 (UTC)


 * I think that would be preferable, but if there aren't interested Python programmers around, it would still be a big help if someone would run the bot at appropriate times and address any non-technical concerns. There's also the issue of reviewing logs -- it seems unclear at this point if the operator be reviewing all changes or just a handful (as in the original proposal). I hope the BAG member who handles this RFBA will make it clear what's expected. — xDanielx  T/C\R 02:40, 9 August 2008 (UTC)
 * I posted a bot request here per Quadell's suggestion. — xDanielx  T/C\R 03:42, 9 August 2008 (UTC)

If anyone else would like to run this bot, we'll reopen the request. – Quadell (talk) 12:43, 18 August 2008 (UTC)


 * The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.