Wikipedia:Bots/Requests for approval/BoxCrawler


 * The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Symbol keep vote.svg Approved.

BoxCrawler
Operator: Alanbly

Automatic or Manually Assisted: Automatic, Semi-Supervised

Programming Language(s): Python (Using slightly modified [ask if you want details] pywikipediabot framework)

Function Summary: Adding correct "needs-infobox" parameter to talk pages of articles using WPSchools banner

Edit period(s) (e.g. Continuous, daily, one time run): One major run, then sporadic runs at the request of WikiProject Schools

Edit rate requested: There are 7999 pages in the category so 5,000 - 10,000 (maximum) edits per run (About twenty per day after initial run). Will Be limited to 6 edits (15 accesses) per minute.

Already has a bot flag (Y/N): N

Function Details: This bot will visit every article from the category Category:WikiProject Schools_articles_by_quality, and its subcategories (Those with the WPSchools). It will examine the corresponding article for each in the Mainspace (if there is one), and detirmine whether each has an infobox. The bot will then place the correct "needs-infobox" parameter on the WPSchools template (On the talk page). It will only add this parameter if the template is empty or if the rating given corresponds to an article (Stub, Start, B, GA, A, FA). It will also record all infoboxes found (or not found) in a log at User:BoxCrawler/Log.

Discussion
Sounds fine to me. How many edits per minute are you going to limit it to? --Android Mouse 04:09, 27 June 2007 (UTC)
 * I forgot to say, it's generally advised that you have the word 'bot' somewhere in the bot's username. Also, Alanbly can you edit this page to confirm it is you? --Android Mouse 04:12, 27 June 2007 (UTC)
 * "Crawler" sounds bot-like enough. Max S em 09:17, 27 June 2007 (UTC)


 * This is indeed me, and I preferred Crawler to Bot. As far as edit frequency, I'm not sure. As this is my first bot and it is purpose built, I don't really have a benchmark. Could you point me to one? Adam McCormick 13:42, 27 June 2007 (UTC)
 * I guess crawler sounds bot-like enough, not sure what I was thinking. 5 to 15 or so edits per minute is generally acceptable. --Android Mouse 17:25, 27 June 2007 (UTC)


 * We need this bot as the project has had problems with old templates reporting that there is an infobox when there is not(or vice versa). Obviously erroneous judgements undermine any correct assessments that are on the same template. One suggestion is that we may be putting "needs infobox" on very small stubs. Suggest that if they are small then also set to "stub" and importance="low" (lets save a human from evaluating pages that have only twenty-sixty words on them. Oh and rate ... there are < 8,000 articles that have this template at present. Victuallers 18:30, 27 June 2007 (UTC)


 * I was not aware that you could create bots Adam, looks like a nice piece of work. The WikiProject Schools Assessment Team do a lot of repetitive stuff manually so getting a bot would be a great help given the amount of school articles there are! I do also like the idea of advancing this bot later to give obvious assessments automatically. Camaron1 | Chris 18:40, 27 June 2007 (UTC)


 * OK, what kind of infobox detecting regexps will it use? I suppose that most have the word "infobox" in them, I'd assume.  Voice -of- All  00:05, 4 July 2007 (UTC)
 * Since you ask (and from that I'll assume you know something of the) I use the following (Applied to the wikitext of the page):

getInfobox = re.compile(r'{{[Ii]nfobox[\s_]+[^\|}:]+|({\||<div)\s*class="[^"]*infobox[^"]*"')
 * Basically anything with "{{infobox" or "{{Infobox"followed by a space then some characters that aren't any of "\|}:" Adam McCormick 16:39, 4 July 2007 (UTC)
 * The only issue I can see with this strategy is that it doesn't work for subst:'d templates, but I can't see a way to fix this. Adam McCormick 23:51, 5 July 2007 (UTC)
 * I've added an extra case, for div boxes that use class="infobox" maybe this will increase the catchment. So far it's working alright. Adam McCormick 06:04, 8 July 2007 (UTC)


 * {{BotTrial}} (POC) I think we can get a better idea for how this will run with a proof of concept, if you ready to run, can you run it on say 25 edits, and let us see the results. Additional questions above should still be addressed, and additional questions may appear once testing has begun. —  xaosflux  {{sup| Talk }} 14:02, 4 July 2007 (UTC)
 * I'm having trouble with the login but as for proof of concept (or at least that the bot correctly edits the template), the edits have gone through under 138.67.78.236 Adam McCormick 00:00, 6 July 2007 (UTC)
 * Looks like I've got the glitch worked out. I'm going to run the 25 edits and I'll post here (and at your talk page) when they are complete Adam McCormick 05:38, 8 July 2007 (UTC)
 * Test Run is complete, edits may be seen here Adam McCormick 06:00, 8 July 2007 (UTC)

Test run review
A random sampling of 20 edits is showing a high error rate. See below for possible errors:
 * 1) Vestal Senior High School (has infobox)
 * 2) *Has what looks like an infobox but is really just a table
 * 3) Talk:Upper St. Clair High School (removed someones assesment that it already has an infobox, and it does)
 * 4) *Infobox template name used an underscore, I'm adding that to my regex
 * 5) University of Louisville (already has infobox)
 * 6) *Same as above, regex
 * 7) Trinity High School (Euless, Texas) (alread has infobox)
 * 8) *Same as above, regex
 * 9) Staples High School (already has infobox)
 * 10) *Once again, only looks like an infobox
 * 11) Presbyterian Ladies' College, Sydney (already has infobox)
 * 12) * Same thing, not actually an infobox
 * It looks like it may be missing certain types of infoboxes, especially ones that have been subst'd. —  xaosflux  Talk 16:48, 8 July 2007 (UTC)
 * It was missing infoboxes if they were placed with an underscore not a space, this has now been fixed, as for the others, they do need infoboxes as they are currently hard-coded tables and we want to standardize the use of infoboxes as templates (So that each logical grouping of schools looks approximately alike). I'm not sure that it's possible to check for subst'd infoboxes unless I hardcode in every infobox on the wiki and that seems a little extreme. I can do this if it's the only way to run the bot though. Adam McCormick 20:03, 8 July 2007 (UTC)
 * What if needing an infobox adds the page to a category? If these articles with subst'ed infoboxes get categorized, whoever goes through the category can transclude an infobox onto the page. Adding them to the category will draw necessary attention to them. What do you think?  W ODU P  20:36, 8 July 2007 (UTC)
 * Putting "needs-infobox=yes" parameter on the banner adds a page to Category:WikiProject Schools articles needing infoboxes already Adam McCormick 20:46, 8 July 2007 (UTC)
 * So, I don't see that adding articles with subst'ed infoboxes to that category (as it apparently does now) would be a problem.  W ODU P  21:07, 8 July 2007 (UTC)
 * As many of these look to the casual observer (e.g. myself) to be infoboxes, although they are not "infoboxes" they certainly are collections of facts, in a box, in the normal info box location. Can you link to the project or other discusion where consensus on which particular type of infobox has been decided on for this entire category of pages? —  xaosflux  Talk 22:49, 8 July 2007 (UTC)
 * The listing of which infoboxes to use is here and here and there is ongoing discussion about condensing/homogenizing infoboxes. Some examples are here, here, and here. In many ways, a bot is much more suited to detirmining whether these are or are not infoboxes. I will ask the other folks from the project to weigh in if this isn't enough. Adam McCormick 23:39, 8 July 2007 (UTC)
 * You may re-trial up to 50 edits if you think you got the last bugs out. Thanks, —  xaosflux  Talk 00:45, 11 July 2007 (UTC)
 * Alright, I'm rerunning the trial Adam McCormick 01:07, 12 July 2007 (UTC)
 * Trial is complete and the results can be seen here. Note that the SG (Singapore) template was recently changed so the pages found first are those with this template. Adam McCormick 04:22, 12 July 2007 (UTC)

Actually, anything matching class[ ]*\=[ ]*(?:infobox|[\"\'][^\"\']*infobox[^\"\']*[\"\']) should probably be marked with newinfobox whether it is a school or not. —freak(talk) 01:41, 18 July 2007 (UTC)


 * I didn't know about that template, I may add it now that I do, I was hoping to get approved for just checking for infoboxes before I tried to add any more spiffy functionality (like editing mainspace). Thanks Adam McCormick 04:26, 18 July 2007 (UTC)

Not to be impatient, but what else needs to happen to get this approved? I'm hessitant to post to every BAG member again but I will if I need to. Adam McCormick 02:30, 19 July 2007 (UTC)


 * I don't see any obvious errors in your edits. Go, finish your project and let us know if you decide to do something drastically different from your current task, otherwise good luck. If you want a bot flag, only a bureaucrat would be able to assign that, and in case you didn't notice, nobody here is one. FWIW, if you're editing only outside of article space at non-ludicrous speeds, I doubt a bot flag would be necessary or helpful. Have fun. —freak(talk) 06:49, 19 July 2007 (UTC)


 * Thanks, I'll get started. And Can you put this into the approved section then or bot's needing tags or such, as I will be extending the functionality once the major task is complete. Adam McCormick 17:01, 19 July 2007 (UTC)


 * The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.