Wikipedia:Bots/Requests for approval/The Auto-categorizing Robot


 * The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Symbol keep vote.svg Approved.

The Auto-categorizing Robot
Operator: ThaddeusB

Automatic or Manually assisted: Manually assisted for a while to confirm weird page formatting doesn't screw it up then full-automatic.

Programming language(s): Perl

Source code available: http://thaddeusb.awardspace.com/NRHP.txt

Function overview: Add NRHP historic district categories to pages with infobox nrhp as per this request.

Edit period(s): Most likely one time

Estimated number of pages affected: ~10-20% of the just under 20,000 pages that use infobox nrhp or 2000-4000 pages.

Exclusion compliant (Y/N): N

Already has a bot flag (Y/N): N

Function details: infobox nrhp currently automatically places pages into Category:Historic districts in the United States if  is set to   or. The National Register of Historic Places Wikiproject wishes to remove this functionality and instead place each HD in its respective State category (e.g. Category:Historic districts in New York). In order to accomplish this, they need a bot to go through & put each HD into a category or else most won't have any HD category at all.

Instead of just writing code to add the generic code to every page, I have written code to further help the project by automatically putting most articles directly into the correct state. The logic is as follows:


 * 1) Load list of pages that use infobox nrhp
 * 2) Is the page a HD? If no, go to next page. If yes, continue.
 * 3) Does the page already contain Category:Historic districts in the United States? If so, temporary strip it out
 * 4) Does the page contain any other HD category? If so, save without redundant US category (if needed) & go to the next page
 * 5) Try to determine what state the HD is located in by looking at  parameter of the infobox
 * 6) If that fails, try the  parameter of the infobox
 * 7) If that fails, try the text of the article's lead section
 * 8) If all else fails or 2+ states are possible matches, use the generic US category
 * 9) Save & continue with the next page

It will also make a log of all changes for possible rapid review by humans. The log of what it would do with the first 100 entries can be found here.

The bot has been programmed with the assumption that this discussion will result in the renaming of 4 non-standard categories. If for some reason that doesn't happen I will have to modify the code slightly.

Discussion
I have tested the first 150 or so results locally and found no issues. However, I plan to run this in "display each change locally before uploading mode" for a while just to be sure it isn't messing up when encountering unusual wiki formatting. --ThaddeusB (talk) 20:15, 17 September 2009 (UTC)
 * Comment I was the one who asked for this bot action after talking with other WP:NRHP members. As far as I can understand the logic, this bot would do all we wanted (and more) without causing any problems.  Because most infoboxes contain the locmapin parameter, and most of those that don't will have the state name in the locmapin parameter, there aren't likely to be many pages in which the bot is forced to look at the text.  Just curious, however: exactly what would the bot use in the lead section?  I just want to avoid the possibility of confusion — for example, if we had an article about a historic district in Indiana, Pennsylvania (which we don't) in which the bot was forced to look at the text, would it sort it into the Indiana statewide category, the Pennsylvania statewide category, or the nationwide category?  No problems if the bot is told that such a situation is an "all else fails" result.  By the way, there are currently 3,119 articles in Category:Historic districts in the United States.  Nyttend (talk) 02:40, 22 September 2009 (UTC)
 * It would go to the nationwide category. As it happens, in the sample run there was a HD in Michigan that happened to be bounded by "Florida street".  As such, the bot could decide between Michigan & Florida and so put it in the nationwide category. (Of course not literally since it didn't write the change.) It works by counting the number of different matches and then not adding if the number is not one, except for double matches on both "Virginia" and "West Virginia" or "Washington" and "Washington, D.C." for obvious reasons.  This applies to either the location= parameter or the lead.  It will make a log of each article is scans and note exactly what it did & why, which will be dumped into a sortable table to make actions of any particular type easy to find (by sorting the table). --ThaddeusB (talk) 03:09, 22 September 2009 (UTC)
 * I think this sounds like a very useful bot, and I have no problems with approving it. ··· 日本穣 ? · 投稿  · Talk to Nihonjoe 13:55, 23 September 2009 (UTC)
 * Comment The bot looks pretty straight-forward and useful. I think articles where there is no locmapin or location parameter should be listed on a separate subpage for the NRHP project members to run through and verify after the bot tags them, or, if there are only a few, they could simply be ignored by the bot other than to list on a subpage for the NRHP members. No other problems with this bot or its operator for this particular task. That's my opinion. --69.225.3.119 (talk) 22:49, 23 September 2009 (UTC)

Mr.Z-man 00:48, 24 September 2009 (UTC)


 * Trial's fine with me, but I would like the outstanding issue of what is done about pages without locmapin or location parameters decided. --69.225.3.119 (talk) 01:09, 24 September 2009 (UTC)


 * The bot already outputs the result of each page it loads in a sortable table. If the WikiProject decides they wants the list you described - or any other list - I will be happy to generate said list from the output tables at that time. --ThaddeusB (talk) 13:07, 24 September 2009 (UTC)
 * The original request that I filed at Bot Requests — with which other project members had already agreed — was to have the bot add Cat:HDs in the USA to all of these articles. I'm sure that none of us have a problem with the idea of the bot becoming confused and dumping an article into the nationwide category.  Nyttend (talk) 01:27, 25 September 2009 (UTC)
 * That seems like a workable solution, then these can also be checked by the project members to try to get a state on them, particularly if an output table is generated with all of them. However, the project is on top of this, and however they want to handle it is fine, as they've already raised the issue. --69.225.5.4 (talk) 21:47, 25 September 2009 (UTC)

- Log. The only issue was a typo in the edit summary that caused it to point to a non-existent page, rather than the log. --ThaddeusB (talk) 04:52, 27 September 2009 (UTC)

The trial edits look fine, imo, and the project that requested the bot is monitoring the output, so I don't see any concerns. --69.225.5.4 (talk) 18:20, 27 September 2009 (UTC)
 * Is it intentional to leave the pages in the United States HD category when adding the more specific HD cat? - Kingpin13 (talk) 04:47, 30 September 2009 (UTC)
 * I didn't catch that. It shouldn't be done-that's overcategorizing if the state HD categories are subcats of the US HD cat. --69.225.5.4 (talk) 04:59, 30 September 2009 (UTC)
 * Yuh, I suspect that because the infobox automatically places pages into this category (and that can and will be removed from the template, rather than the pages), ThaddeusB hasn't set the bot up to see if the category is placed directly onto the page. But that's just my guess :) - Kingpin13 (talk) 05:05, 30 September 2009 (UTC)
 * Any direct categorization should be removed like it did here. Do you have an example where it didn't?
 * However, every single page is going to be in the national category since it is added by the infobox. That functionality will be removed from the infobox after the run is complete. --ThaddeusB (talk) 12:01, 30 September 2009 (UTC)
 * Bot seems to have worked just as it should have, except for the log upload failure; is that a major problem? The bot edited plenty of pages where there shouldn't be any historic district category at all, but that's an issue for WP:NRHP to take care of: the problem is that some articles with infobox nrhp shouldn't have them.  It's best to have the bot do like it's doing and not try to determine whether the infobox really belongs there, so (in my mind) this is more evidence that the bot is working well.  Nyttend (talk) 12:41, 30 September 2009 (UTC)
 * The log upload failed due to me using the wrong variable name, which has been corrected. --ThaddeusB (talk) 13:39, 30 September 2009 (UTC)

I was sure that this edit left a direct nation-wide category in, but obviously I must have misread one of the other cats. It appears that this task is wanted/useful, and the bot works great; none of us four seem to have a problem with the actual edits. Good to go - Kingpin13 (talk) 16:40, 30 September 2009 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.