Wikipedia:Bots/Requests for approval/KarlsenBot 2


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol oppose vote.svg Withdrawn by operator.

KarlsenBot 2
Operator:

Automatic or Manually assisted: Automatic

Programming language(s): Python

Source code available: Standard pywikipedia, invoked as python replace.py "-links:User:Peter Karlsen/amazonlinks" -regex -always -namespace:0 -putthrottle:10 "-summary:removing redundant parameters from link[s], task 2" "-excepttext:http\:\/\/www\.amazon\.com[^ \n\]\[\}\{\|\<\>\.\,\;\:\(\)\"\']*\&node\=\d*" "(http\:\/\/www\.amazon\.com[^ \?\n\]\[\}\{\|\<\>\.\,\;\:\(\)\"\']*)\?[^ \?\n\]\[\}\{\|\<\>\.\,\;\:\(\)\"\']*" "\\1"

Function overview: removes associate codes from amazon.com urls

Links to relevant discussions (where appropriate): Bot_requests

Edit period(s): Continuous

Estimated number of pages affected: 20,000

Exclusion compliant (Y/N): yes, native to pywikipedia

Already has a bot flag (Y/N): yes

Function details: External links to amazon.com are sometimes legitimate. However, the acceptance of associate codes in such links encourages unnecessary proliferation of links by associates who receive payment every time a book is purchased as a result of a visit to the website using the link. Per the bot request, the portion of every amazon.com link including and after the question mark will be removed, thereby excising the associate code while maintaining the functionality of the link.

Discussion
Why not grab the link rel="canonical" tag from the amazon.com page header? It has a clean URL which can then be re-used, and if grabbed programmatically, there's no chance of accidentally latching onto someone's affiliate account; the affiliate system is cookie based. &mdash;Neuro (talk) 11:34, 8 October 2010 (UTC)

Are there no examples of Amazon urls with parameters that are needed to reach the correct target? What if it is a FAQ of some Help page that the external link points to? And I agree with the above, that it would be better to replace the url with the site's specified canonical url (e.g.  for this page) — HELL KNOWZ  ▎TALK 12:02, 8 October 2010 (UTC)
 * Amazon urls for which the parameters after the question mark actually are relevant (those beginning with http://www.amazon.com/gp/help/ ) can simply be excluded from the bot run. There are substantial benefits, both in design simplicity, and ability to complete the task, derived from modifying the existing urls: if the bot attempts to visit amazon.com 20,000 times to retrieve canonical urls for each link with an associate code, there's a significant possibility that my IP address will be blocked from further access to amazon for an unauthorized attempt to spider their website. Peter Karlsen (talk) 16:40, 8 October 2010 (UTC)
 * Then you will have to produce a list of exclusions or a list of inclusions of urls, so that no problems arise, as these edits are very subtle and errors may go unnoticed for a very long time. — HELL KNOWZ  ▎TALK 17:16, 8 October 2010 (UTC)
 * Exclusion of all urls beginning with http://www.amazon.com/gp should be sufficient. Peter Karlsen (talk) 17:52, 8 October 2010 (UTC)
 * What about (no params)? Sure, it's unlikely to be in a reference, but that's just an example. —  HELL KNOWZ  ▎TALK 10:09, 9 October 2010 (UTC)
 * Yes, it doesn't appear that exclusion of http://www.amazon.com/gp would be sufficient. However, by preserving the &node=nnnnnn substring of the parameters, the breakage of any links should be avoided (for instance, http://www.amazon.com/kitchen-dining-small-appliances-cookware/b/ref=sa_menu_ki6?&node=284507 produces the same result as the original http://www.amazon.com/kitchen-dining-small-appliances-cookware/b/ref=sa_menu_ki6?ie=UTF8&node=284507, http://www.amazon.com/gp/help/customer/display.html?&nodeId=508510 is the same as http://www.amazon.com/gp/help/customer/display.html?ie=UTF8&nodeId=508510, etc.) Peter Karlsen (talk) 17:06, 9 October 2010 (UTC)
 * And are you certain there are no other cases (other parameters, besides "node"), as I picked this one randomly. — HELL KNOWZ  ▎TALK 17:40, 9 October 2010 (UTC)
 * Yes. As another example, http://www.amazon.com/b/ref=sv_pc_5?&node=2248325011 links to the same page as the original http://www.amazon.com/b/ref=sv_pc_5?ie=UTF8&node=2248325011. None of these sort of pages would normally constitute acceptable references or external links at all; when amazon.com links are used as sources, they normally are to pages for individual books or other media, which have no significant parameters. However, just in case, links to amazon help and product directory pages are now covered. Peter Karlsen (talk) 17:53, 9 October 2010 (UTC)
 * Nevertheless an automated process cannot determine unlisted (i.e. blacklist/whitelist) link suitability in the article, no matter where the link points to. Even if a link is completely unsuitable for an article, a bot should not break it; it is human editor's job to remove or keep the link. — HELL KNOWZ  ▎TALK 17:59, 9 October 2010 (UTC)
 * Some bots, such as, purport to determine whether links are suitable. However, since this task is not intended for that purpose, it has been modified to preserve the functionality of all amazon.com links. Peter Karlsen (talk) 18:06, 9 October 2010 (UTC)
 * XLinkBot works with a blacklist, I am referring to "unlisted (i.e. blacklist/whitelist) link[s]", i.e., links you did not account for. In any case, the error margin should prove very small, so I have no actual objections. — HELL KNOWZ  ▎TALK 18:09, 9 October 2010 (UTC)


 * This bot seems highly desirable, although I don't know if something more customized for the job than replace.py would be desirable. Has this been discussed with the people who frequent WT:WPSPAM (they may have some suggestions)? How would the bot know which pages to work on? I see "-links" above, although my pywikipedia (last updated a month ago, and not used on Wikipedia) does not mention -links in connection with replace.py (I suppose it might be a generic argument?). As I understand it, -links would operate on pages listed at User:Peter Karlsen/amazonlinks: how would you populate that page (I wonder why it was deleted)? I suppose some API equivalent to Special:LinkSearch/*.amazon.com is available – it would be interesting to analyze that list and see how many appear to have associate referral links. It might be worth trying the regex on a good sample from that list and manually deciding whether the changes look good (i.e. without editing Wikipedia). I noticed the statement ".astore.amazon.com is for amazon affiliate shops" here, although there are now only a handful of links to that site (LinkSearch). Johnuniq (talk) 01:54, 9 October 2010 (UTC)
 * User:Peter Karlsen/amazonlinks is populated from Special:LinkSearch/*.amazon.com via cut and paste, then using the bot to relink the Wikipedia pages in which the external links appear. The -links parameter is described in the replace.py reference . I will post a link to this BRFA at WT:WPSPAM. Peter Karlsen (talk) 17:06, 9 October 2010 (UTC)
 * Also, I've performed limited, successful testing of a previous regex to remove associate codes on my main account, with every edit manually confirmed (similar to the way that AWB is normally used.) Peter Karlsen (talk) 17:28, 9 October 2010 (UTC)
 * Strictly speaking you are not "Remove[ing] associate code from link[s]", you are "removing redundant parameters from link[s]", as majority of those edits do not have any associate parameters. I am always strongly suggesting of having a descriptive summary for (semi)automated tasks, preferably with a link to a page with further info. — HELL KNOWZ  ▎TALK 17:40, 9 October 2010 (UTC)
 * I can rewrite the edit summary as described, with a link to this BRFA. Peter Karlsen (talk) 17:56, 9 October 2010 (UTC)


 *  MBisanz  talk 22:55, 19 October 2010 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.