Wikipedia:Bots/Requests for approval/Joe's Olympic Bot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol keep vote.svg Approved

Joe's Olympic Bot
Operator:

Time filed: 05:18, Friday August 31, 2012 (UTC)

Automatic, Supervised, or Manual: supervised

Programming language(s): PERL, MediaWiki API

Source code available:  Will be made available. Recent draft of traversal code, still hacky.

Function overview: Find 2012 Summer Olympic athletes sourced only to "http://www.london2012.com/" (not deep links, but to the home page) and attempt to replace those links with deep link to the athlete's profile there, determined by a search of the london2012.com web site with a variety of sanity/consistency checks.

Links to relevant discussions (where appropriate):   Dr. Blofeld did a Bot Request here and didn't get much feedback, later we discussed it separately, and I posted on the Pump here

Edit period(s): One-time run.

Estimated number of pages affected: 1118 (Those marked "BADREF" here)

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): No

Function details:  Traverses the list of athletes of the 2012 Summer Olympics using the appropriate category tree. There are approximately 10,000 such articles.

Looks for articles which include a reference, but only a reference to the home page of the London Olympics (e.g., http://www.london2012.com/). There are roughly 1,100 of these. (In a few cases I may miss an external link that a human might consider a reliable reference, but I'll still only do *anything* to biographies above that include the link to the specific URL given above.)

For those thousand or so articles, I'll query london2012.com's search with the athlete's name, and attempt to retrieve the athlete's profile page from the directory. An example of a URL for such a profile is http://www.london2012.com/athlete/jager-maja-1078183/ for "Maja Jager". If I can find such a URL, and it passes a set of sanity checks, then replace the shallow home page link with the deep link.

Known issues to be addressed during development if approved for trial:
 * 1) Check that the URL I end up is of the form listed above: That is, did the URL refer to a resource in the "athlete" directory, starting with the last name of the athlete, ending in a number, containing the first name of the athlete? (or something like this, as the data drives the sanity checking.)  Not done in this way, instead, I'm doing comparisons from the name in the profile.
 * 2) Rate-limiting searches to london2012?  Some delays inserted.
 * 3) Appropriate handling of non-ASCII names (The print code that generated the output I uploaded to the BADREF list above obviously flubs this.) Search function seems to find results despite accent differences, woot.
 * 4) Better detection of some reliable templates to sports DBs, such as "iaaf name", as ref-equivalents, that often appear as external links in these articles.  (I'm only aiming to hit the basically unreferenced articles for now, if I want to go fix *all* the london2012.com home links, I'll make a separate task request after this task is complete.)  Mostly complete in this data set.
 * 5) Some searches will fail.  In hand-simulating this task, I've found five articles so far in which the athlete is present in the database, but the name doesn't match, that is, our article has a different spelling than the Olympics database and other sources. The code should handle this now.
 * 6) Exclusion of non-article space entries  (e.g., there's a couple user space files in the list, there shouldn't be, but I need to handle that case.) done.
 * 7) Exclusion of the overview articles (Some of the "list of handballers in the 2012..." articles keep going in and out of the categories I'm traversing. They shouldn't be there, but I need to gracefully deal if they are.) Ignores (and warns) on articles that aren't in the living persons category.

Discussion
Note: Whoops, I created the User/User Talk page for the linked account using the account, not my own. Bad editor, no cookie--sorry about that. --j⚛e deckertalk 16:07, 31 August 2012 (UTC)
 * I left a message on my talk page, but it may be useful for you to use CLO it is used like
 * Happy to take advantage of that (both in any hand-work I do, and in any approved bot use. --j⚛e deckertalk 03:04, 2 September 2012 (UTC)
 * Oh, and it should be substituted. Ryan Vesey 03:08, 2 September 2012 (UTC)

 MBisanz  talk 00:36, 11 September 2012 (UTC)


 * Comments/Results of Trial The trial was conducted in small batches of increasing size, and some development work was done during the process, as problems, etc., were observed.  An early error (e.g., Felismina Cavela) exposed a problem in which some of the checks to insure the profile matched the article weren't working, although notices were generated for me, the edits were still made. The code handled that type of case correctly after a code fix.
 * A later error at Jeff Hunt (athlete) uncovered a problem in how I created the ref's title in the presence of a disambiguation parenthetical. I reverted/reran that case, and my code functioned as expected the second time.
 * The code is still not clever about matching name-order mismatches (Foo Bar vs. Bar Foo, something that comes up a lot in a variety of Asian athletes were English coverage is often mixed as to the proper name order) This isn't a problem per se, it just means these are flagged for my attention and not edited by the bot code.  I don't plan to improve this case, I think hand-attention is probably a bit safer.
 * The code still has some work to do on matching in the presence of accents (The London2012 site isn't consistent about whether accents are preserved in athlete names.)  I do plan on improving it, but right now the issue is simply that the code doesn't make a handful of edits it probably could be allowed to do.  This is a case where I think some improvements will be simple and beneficial.
 * If allowed to continue is basically continue to slowly ramp up batches in a watchful, semi-automated way until the remaining thousand or so articles are handled.
 * Any comments/issues/concerns/feedback welcome. Thank you!  --j⚛e deckertalk 16:14, 11 September 2012 (UTC)


 * Looks good, and clearly a trustworthy op with a good operational plan and an envious caution. - Jarry1250 [Deliberation needed] 21:24, 12 September 2012 (UTC)


 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.