Wikipedia:Bots/Requests for approval/Chartbot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol keep vote.svg Approved

chartbot
Operator:

Time filed: 16:05, Friday February 22, 2013 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): PHP

Source code available:

Function overview: Repair links to Billboard charts that were broken in the last revamp of Billboard's site, and convert the (extremely few) links to the new format site to templates to avoid future similar disasters

Links to relevant discussions (where appropriate):

Edit period(s): Periodic

Estimated number of pages affected: 4428

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): No

Function details: The bot looks for URLs in the following formats:
 * 1) http://www.billboard.com/artist/ artist name, urlencoded/chart-history/magic number?f=chart number&g=Singles
 * 2) http://www.billboard.com/artist/ artist name, urlencoded/chart-history/magic number?f=chart number&g=Albums
 * 3) http://www.billboard.com/artist/ artist name, urlencoded/chart-history/magic number>
 * 4) http://www.billboard.com/artist/ /magic number/artist name, urlencoded/chart?f=chart number
 * 5) http://www.billboard.com/artist/ /magic number/artist name, urlencoded/chart

Formats 1-3 can also have "/#" inserted at nearly random points in the URL. Billboard's old site really only used the number, but did some verification of the surrounding text to foil automated linking. This means they were sloppy about what they presented as the URL field, and this sloppiness was cut-and-pasted into the links.

The first three formats represent links to Billboard's old charts: they are all dead links, landing users on a 404 page. The latter two formats are links to the newer site. They work, but contain hardcoded numbers that will break the next time Billboard revamps.

Each of these formats will be replaced with a call to BillboardURLbyName, using artist names found in BillboardID and chart names from BillboardChartNum.

It's not quite true to list the one section of the URL as "artist name, urlencoded". Billboard is loose about what it considers to be the artist name, and the old site was even looser. To match the artist name, the bot extracts the name from the existing URL, blanks all punctuation marks and removes all sequences of multiple spaces. It does the same to the list of artist names that have been stored in the source files of BillboardID and chooses the appropriate match. This serves to renormalize all the names to the names actually used by the site.

The bot will only make the replacement if it can validate both the extracted artist name and the extracted chart number. This serves as a final safety check on the parsing logic: if any false matches to the formats slip through, it is highly unlikely that the false match on the format will provide correct values for both the chart number and the artist.

I have "dry run" the bot in read-only mode, reading from Wikipedia and creating the new articles on my local hard drive. It found over 90000 links to Billboard.com. Of those 90,000 it determined that it could repair 11829 of them in 4428 articles. Many of the remainder still work (Billboard did not move news articles or links to specific weeks on specific charts). Links to specific albums or singles do not, but those will need to be handled as a separate task: it will require recrawling Billboard's site and building a database of song and album links.

The first run should repair all of the old links. I would like to run it periodically in order to reformat hardcoded URLs as templates so that this problem doesn't repeat in this scale.

Discussion
BAG assistance needed I will have time Tuesday to finish this, so I'd like to get approval to do so.&mdash;Kww(talk) 12:39, 1 March 2013 (UTC)
 *  ·Add§hore·  Talk To Me! 12:58, 1 March 2013 (UTC)

The initial run is complete: I did my 50 edits in a few passes as I found a few small bugs. I'm ready to go, and have asked for review at a few related projects.&mdash;Kww(talk) 19:43, 6 March 2013 (UTC)

Is this just too boring of a topic for anyone to want to discuss?&mdash;Kww(talk) 21:05, 9 March 2013 (UTC)


 * Boring topics are a good thing.  MBisanz  talk 14:48, 11 March 2013 (UTC)


 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.