Wikipedia:Bots/Requests for approval/SQLBot-AmazonAffiliateRemoval


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

SQLBot-AmazonAffiliateRemoval
Operator:

Time filed: 01:11, Tuesday, December 11, 2018 (UTC)

Function overview: Per Spam, removes the affiliate (tag=) portion from amazon links.

Automatic, Supervised, or Manual: Automatic

Programming language(s): PHP

Source code available: Once complete, yes.

Edit period(s): Daily

Estimated number of pages affected: 673 in the first run

Namespace(s): Mainspace

Exclusion compliant (Yes/No): No

Function details: Looks for tag= in amazon urls, this query:

To find pages, and URLs. Once found, removes ?tag= from every amazon url as found above.

The function used to strip the tag= portion of the url is:

Discussion

 * — xaosflux  Talk 01:28, 11 December 2018 (UTC)


 * I note that SQL query is not very efficient. There's no way to really avoid the table scan, but you should be able to greatly improve subsequent runs by remembering the maximum  value from just before the current run and only looking at rows with higher values next time. Anomie⚔ 18:28, 11 December 2018 (UTC)
 * Also you might consider batching the query: select with  or something like that, process whichever rows you got from that, and repeat with increasing X until it's greater than the current  . Ideally adjust the "10000" there so each batch doesn't take more than a second or two to return. Anomie⚔ 18:36, 11 December 2018 (UTC)


 * I ended up implementing batching as suggested above. 100000 ids at a time, each query takes between 1 and 2 seconds to run. In the beginning, there were some issues with url encoding, but those have been resolved. SQL Query me!  23:09, 11 December 2018 (UTC)


 * Q: Is 673 pages the estimated total that have a tag, or just a small trial set? Asking because if it's a large number, it should be aware of archive URLs. Changing an archive URL will break the URL. We have many millions of archive URLs. There are a couple fairly simple ways to avoid archive URLs I can pass along if you would like. -- Green  C  23:42, 11 December 2018 (UTC)
 * , That should be everything, and that's a very good catch. I could check the "host" portion of parse_url to make sure it contains either "amazon" or "amzn". I believe that would be sufficient.
 * By the way - I imagine that we could probably reduce this to a one-time run as well, if someone wanted to make an edit filter or spam blacklist entry to stop these before they get started. SQL Query me!  23:50, 11 December 2018 (UTC)
 * Not sure how it's parsing the article for URLs, might it still pick up URLs in the query or path portion of another URL? The two main types lead with either a "/" or "?url=" like  or   if it back checked for those leading character(s) should be safe. User:Headbomb came up with a regex for this I can try to track down if you'd like. --  Green  C  00:08, 12 December 2018 (UTC)
 * Not sure how it's parsing the article for URLs, might it still pick up URLs in the query or path portion of another URL? The two main types lead with either a "/" or "?url=" like  or   if it back checked for those leading character(s) should be safe. User:Headbomb came up with a regex for this I can try to track down if you'd like. --  Green  C  00:08, 12 December 2018 (UTC)


 * Don't really know what I can do to help here with regexes. However, I'll comment on blocking affiliates with an edit filter. Most often, those are just good faith copy-pastes of URLs. That shouldn't really be blocked, although an edit summary tag might be appropriate. Headbomb {t · c · p · b} 01:35, 12 December 2018 (UTC)
 * I've updated it to require that the hostname returned by parse_url contains either "amazon.", or "amzn." - which will handle any issues surrounding archives. SQL Query me!  04:25, 13 December 2018 (UTC)
 * User:SQL: question: If the Wikisource contains  a regex for an amazon URL would match on   portion, and parse_url would OK it since it has an amazon hostname. --  Green  C  04:41, 13 December 2018 (UTC)
 * , If you aren't familiar with the PHP's parse_url, you can play with it here.
 * The results I get from your specific example are:
 * The results I get from your specific example are:

array ( 'scheme' => 'http', 'host' => 'archive.org', 'path' => '/web/20181210010101/http://amazon.com/', )
 * Of which, 'archive.org' would return false on a strpos( "archive.org", "amazon." ) call, skipping that URL. SQL Query me!  04:46, 13 December 2018 (UTC)
 * Understand parse_url takes a URL to be parsed so question was how the URL is retrieved from the wikisource in the first place, I assumed regex. But looking again, it is not parsing from the wikisource, rather from an SQL query. Then it modifies the URL, and presumably does a search/replace in the wikisource. In which case my initial concern is answered, there is no problem :) --  Green  C  05:06, 13 December 2018 (UTC)
 * , Yep, sorry - I didn't realize that's where we hit a loop. It gets the incoming url directly from the Externallinks table. SQL Query me!  05:12, 13 December 2018 (UTC)
 * Yep my misunderstanding. FYI the Externallinks table is not complete. I've done tests before and found quite a few missing URLs. The reason is they don't parse URLs contained in some templates and other reasons. Probably for this application it's OK. Could possibly supplement with a CirrusSearch afterwards to check any were missed. -- Green  C  05:36, 13 December 2018 (UTC)
 * Yep my misunderstanding. FYI the Externallinks table is not complete. I've done tests before and found quite a few missing URLs. The reason is they don't parse URLs contained in some templates and other reasons. Probably for this application it's OK. Could possibly supplement with a CirrusSearch afterwards to check any were missed. -- Green  C  05:36, 13 December 2018 (UTC)


 * As this is appears to be a single-purpose account, please build out the user page a bit more to explain why this bot is doing what it does so that people looking at it will have an understanding that Wikipedia isn't in some us-vs-Amazon fight. —  xaosflux  Talk 03:16, 12 December 2018 (UTC)
 * ,, but while I'm doing so - I thought it might be appropriate to rename the bot as well, removing the 'amazon' bit. Both for the reason you point out above, and that I might use it later for other affiliate link removals. What do you think? SQL Query me!  04:16, 13 December 2018 (UTC)
 * User:SQLBot-AffiliateRemoval for example sounds fine? — xaosflux  Talk 12:33, 13 December 2018 (UTC)


 * Edits completed by hand, will probably monitor with a Abusefilter later on.  SQL Query me!  19:33, 16 December 2018 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.