Wikipedia:Bots/Requests for approval/LkolblyBot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

LkolblyBot
Operator:

Time filed: 18:57, Monday, December 24, 2018 (UTC)

Function overview: This bot automatically updates Alexa rankings in website infoboxes by querying the Alexa Web Information Service.

Automatic, Supervised, or Manual: Automatic

Programming language(s): Python

Source code available: https://github.com/lkolbly/alexawikibot (presently, the actual saving is commented out, for testing)

Links to relevant discussions (where appropriate): Previous bot that performed this task: OKBot_5

Edit period(s): Monthly or so

Estimated number of pages affected: 4,560 articles are in the current candidate list. A subset of these pages will be updated each month. Other pages could be pulled into the fray over time if someone adds alexa information to a page. Also, there will be a whitelist copied from User:OsamaK/AlexaBot.js of pages that will be edits (presently containing 1,412 pages).

Namespace(s): Articles

Exclusion compliant (Yes/No): Yes (via whatever functionality is already in pywikipedia)

Function details: This bot will scan all pages (using a database dump as a first pass) to find pages which have the "Infobox website" template with both "url" and "alexa" fields.

It will parse the domain from the url field using a few heuristics, and query the domain with AWIS. Domains that have subdomains return incorrect results from AWIS (e.g. mathmatica.wolfram.com returns the result for just wolfram.com), so these domains are discarded (and the page not touched). It will then perform an AWIS query to determine the current website rank and trend over 3 months.

Websites will be classified into,  , and  (, , and , respectively). A site increasing in popularity will gain it the tag, even though it is numerically decreasing (previously, many sites were also classified into IncreaseNegative and DecreasePositive that I didn't understand)

Then, in the text of the article, whatever the current alexa data is will be replaced by something like:

(e.g. 169,386  )

There are two as-yet untested test cases that I'll test (and fix if necessary) before any full-scale deployment:
 * Apparently some infoboxes have multiple  parameters? I have to go find one and see what the bot does with it. (probably the right thing to do is to not touch the page at all in that situation)
 * Some pages have an empty  parameter, which should be fine, but worth testing anyway.

Discussion
Please make the bot's talk page.

"whatever the current alexa data is will be replaced" - how do you know there isn't more than just the previous value? Or that there isn't a reference that is used elsewhere?

I imagine many pages that copy-paste the template code will have an empty  parameter. This would not be any different to not having it at all.

Do you preserve template's formatting?

The particular citation style the bot uses may not match the article's, especially the date format. (I wonder why we don't have an Alexa citation template still.) — HELL KNOWZ   ▎TALK 21:26, 24 December 2018 (UTC)


 * The format of the template code overall is preserved, the value is replaced by replacing the regex, so the rest of the template is unaffected. (the number of spaces before the equal sign goes from "any number" to "exactly one", though)


 * Yeah, I was debating having it skip empty alexa parameters. There's value in adding it (as much as updating it), though for very small sites the increase/decrease indicator may not be particularly useful.


 * I didn't think to check whether there's more than the previous value, though I can't think of what else would be there. There's at least two common formats for this data, basically the OKBot format, and a similar format with parenthesized month/year instead of the asof (see https://en.wikipedia.org/wiki/Ethnologue - note lack of a reference). I guess it would be safest to check that the value is in a whitelisted set of alexa formats to replace, I'll bet a small number of regexes could cover 90% of cases (and the remaining 10% could be changed to a conforming case by hand :D)


 * The reference is interesting, because it's basically a lie. It's a link to the alexa page, but that isn't where the data was actually retrieved from, it was retrieved from their backend API. As for if someone's already using that reference, it shouldn't be too hard to check for that, I would think. I imagine (with only anecdotal evidence) that most of those cases will be phrases like "as of 2010, foobar.com had an alexa rank of 5". Updating that reference to the present value may not make sense in the context of the article (myspace isn't as big as it used to be, an article talking about how big it was in 2008 won't care how big it is now). But either way they should probably be citing a source that doesn't automatically change as soon as you read it.


 * The ethnologue page already looks like it has diverging date formats? I don't know how common that is, I'll have to go dig up the style guide for citations (maybe we should have a bot to make that more uniform). What would it take to make a template? (also, would that solve the uniformity issue? I guess at least it'd be uniform across all alexa rankings)


 * Lkolbly (talk) 14:52, 25 December 2018 (UTC)


 * WP:CITEVAR and WP:DATEVAR is the relevant stuff on date and citation differences. On English wiki, changing or deviating from citation or date style without a good reason is very controversial. The short answer is "don't". Bots are generally expected to do the same, although minor deviations are somewhat tolerated. But bots are expected to follow templates, like use dmy dates or df parameters. — HELL KNOWZ   ▎TALK 16:36, 25 December 2018 (UTC)


 * Okay, it looks like it should be pretty straightforward to just check for the two  tags and set the df parameter. Lkolbly (talk) 14:39, 26 December 2018 (UTC)


 * Updated the bot so that it follows mdy/dmy dates, updating the accessed date and asof accordingly. Also constrained the pages that will be updated to a handful of matching regexes and also pulled a list from User:OsamaK/AlexaBot.js, which eventually I'll make a copy of. Lkolbly (talk) 18:20, 1 January 2019 (UTC)


 * Primefac (talk) 00:43, 20 January 2019 (UTC)


 * Ran bot to edit 50 randomly selected pages. So far I've noticed two bugs that cropped up, one involving leading zeros in the access dates and another where the comment "Updated by LKolblyBot" got repeated. I'm going to go through and fix the issues by hand for now and apply fixes to the bot. Lkolbly (talk) 20:20, 27 January 2019 (UTC)


 * Also, looking closer, some pages got a "Retrieved" date format that doesn't match the rest of the page (e.g. Iraqi News), but I'm pretty sure it's because those pages aren't annotated with dmy or mdy. Lkolbly (talk) 20:47, 27 January 2019 (UTC)
 * I have questions.
 * First, Special:Diff/880480890 - is there a reason it chooses http over https?
 * Second, why do some diffs use ISO formatting for the date while others actually change it to dmy?
 * Third, are OKBot and Acagastya still updating these pages, and would it make sense to remove those names from the comments?
 * My fourth/fifth questions were going to be what you were going to do about duplicate names, but it looks like you noticed that and are taking care of it, along with a lack of leading zeros issue with the 2019-01-27.
 * Also, as a minor point, even if you've only done 44 edits with the bot, please make sure when you finish with a trial that you link to the specific edits, since while "Contribs" might only show those 44 edits now, after you've made thousands they won't be the first thing to look at.
 * Actually, I do have another thought - for brevity, it might be best to have a wikilink in the edit summary instead of a full URL. Primefac (talk) 20:12, 28 January 2019 (UTC)


 * I have answers.
 * There's no particular reason it uses http over https for the alexa.com link, I hadn't given it a second thought. I can change it to https.
 * The variations in date formatting are an attempt to stick with the articles predominant style. The default style being ISO format, and if there's a use dmy or use mdy tag it uses the respective format.
 * OKBot appears defuct, I wasn't aware of Acagastya, though from their user page it looks like they've left English Wikipedia at least. It does make sense to remove the (now duplicate) comments, that was ultimately the goal but it didn't work as planned.
 * Good point on the making a list of the trial edits, conveniently it looks like I can search the contribs to make a view of just the trial edits.
 * Yeah, the wikilink idea occurred to me a few minutes too late, it looks terrible in the commit message :/ Lkolbly (talk) 23:32, 28 January 2019 (UTC)
 * With the constant modification that Alexa goes through, it is not a good idea to put manual labour for updating the ranks. acagastya  08:53, 29 January 2019 (UTC)


 * Regarding the 'Updated monthly by ...' lines - as is being demonstrated here there are stale entries - and it can be expected as no bot should ever be expected to operate in the future. To that end I don't think this should be added, and would support having the next update remove any existing such comment codes. —  xaosflux  Talk 15:21, 7 February 2019 (UTC)
 * Please implement the above changes in this run. Primefac (talk) 21:30, 14 February 2019 (UTC)
 * was the trial completed? What are the results (please link to diffs as well). — xaosflux  Talk 18:51, 12 March 2019 (UTC)


 * Sorry I've been dead in the water this last month, time hasn't been on my side (I figured I'd re-architect my server before I ran the trial, and have everything nice and containerized, but that didn't work out and then one thing led to another). I haven't done the trial yet, I plan to run it this coming weekend though. Lkolbly (talk) 19:10, 12 March 2019 (UTC)


 * Okay, ran the bot on these 50 pages. Some notes:
 * r.e. the "Updated by" comments: So it turns out the framework I'm using (pywikibot) strips out the comments, which is why they were being duplicated. This run did not add "updated by" comments. Removing existing comments could be done but would have to be a separate script.
 * I think I'll change the change comment to "Bot: Update Alexa ranking (link to a list of sites that the bot maintains)"
 * Some sites (e.g. Gothamist) list a URL in the infobox that is not ostensibly the site's actual (or main) URL, which gives an inaccurate alexa ranking. I think this is beyond my control though.
 * The original formatting of the infobox is unfortunately lost in pywikibot. The spacing varies - some (Adventure Gamers) use no spaces after the vertical bar, most one space, some align the equals signs, some don't (or do so inconsistently). Regardless, the information is gone at rewrite time.
 * A large number of sites had an "April 2014" style alt text specified for the "as of" tag. This script eliminates those.
 * One page (Shutterfly) had the "alexa" ref specified in a separate infobox references section at the bottom of the infobox, which led to a duplicate reference name error.
 * Otherwise, everything seemed to run fairly smoothly. The last point I might be able to handle by searching for  or something in the page text. I think it's a fairly rare occurrence though.
 * Lkolbly (talk) 02:13, 20 March 2019 (UTC)
 * regarding the increase vs increasenegative difference, my reading is that this is a Numerical/Desirable field (see Template:Infobox website and related talk pages). Moving a Ranking from 5th to 1st is a "decrease" in value, but an increase in desirability.  Arguably "1st place" is an increase from "2nd place" though so normal increase could be fine here - this should be sorted out at Template talk:Infobox website, and that template documentation should be updated to match before this begins.  You don't want your bot to be warring with human editors over the direction of a triangle. —  xaosflux  Talk 13:59, 28 March 2019 (UTC)
 * please let us know any results from: Template_talk:Infobox_website. — xaosflux  Talk 14:05, 2 April 2019 (UTC)


 * See also Village_pump_(policy) asking about the utility of this data at all. — xaosflux  Talk 14:59, 2 April 2019 (UTC)
 * task denied. The discussion at VPP in Special:PermaLink/894265275 is indicating that these edits are not widely supported.  Should this change through future discussions feel free to open a new BRFA referencing this one, and linking to the supportive discussion. Best regards, —  xaosflux  Talk 23:58, 26 April 2019 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.