Wikipedia:Bots/Requests for approval/RBSpamAnalyzerBot


 * The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Symbol keep vote.svg Approved.

RBSpamAnalyzerBot
Operator: ReyBrujo

Automatic or Manually Assisted: Automatic unsupervised

Programming Language(s): Bash

Function Summary: Upload several statistics about external links and catch spambot-created pages.

Edit period(s) (e.g. Continuous, daily, one time run): Once per database dump (approx. every 45-60 days)

Edit rate requested: around 20 edits per minute

Already has a bot flag (Y/N):

Function Details: Supposing the bot will continue editing User:ReyBrujo/Dumps:
 * 1) Archive the previous dump at User:ReyBrujo/Dumps/Archive by adding a link to it there.
 * 2) Modify User:ReyBrujo/Dumps to transclude the page of the new dump analysis, adding more interwikis if other Wikipedias had been processed.
 * 3) Create the page of the new dump analysis (using the format User:ReyBrujo/Dumps/yyyymmdd). The created page will hold information about the lists of articles with the most external links, the most linked sites, amount of pages and external links (in the dump), external links found in the main namespace, and amount of pages in the main namespace that are not redirects, and two sets of lists, Very probable spambot pages (pages with titles having /wiki/ or /w/) and Probable spambot pages (pages ending in /). These lists (right now) must be manually reviewed to determine whether they have been created by spambots or not.
 * 4) Create the pages containing the lists, with titles like Articles with more than xxx external links, Articles with between xxx and yyy external links, Sites linked more than xxx times and Sites linked between xxx and yyy times, depending on the content.

To get the statistics, the bot will download the latest database dump from download.wikimedia.org and process it locally.

Discussion
A quick observation: the first dump analysis I created were done entirely by hand, and the process was slowly automatized. Although a few edits every other month may appear as unnecessary for the bot flag, I run the analysis not only for the English Wikipedia, but also for over 250 others, and while right now I am supervising the edits, it would be much easier if I could leave this running as a cron job. -- ReyBrujo 23:17, 25 May 2007 (UTC)
 * And yes, I know the bot will need bot flags in the other Wikipedias as well, however I prefer getting it the bot flag here before requesting the bot flag at Meta or at any other community. -- ReyBrujo 23:20, 25 May 2007 (UTC)
 * You wrote WHAT? With WHAT? Cool. Since it's only editing your userspace, for however long you want/need. --ST47 Talk 01:06, 26 May 2007 (UTC)
 * Well at some point he is going to need a flag to do 20 edits per minute, just thought I'd note that. ——  Eagle 101 Need help? 12:12, 27 May 2007 (UTC)
 * Yes - can we cut the edit rate to about 10/15 epm? Also, any chance of seeing the code (more out of curiosity... :P). Martinp23 12:17, 27 May 2007 (UTC)
 * Hehehe, actually I am guessing it will never write more than 10 pages per minute. However, some Wikipedias (like ar.wikipedia.org) state they only grant bot status to external bots if they have such flag in the English Wikipedia too. I am also hoping the bot will tag articles that are cataloged as very possible spam pages for speedy deletion (after creating a whitelist). -- ReyBrujo 15:29, 27 May 2007 (UTC)
 * Oh, and yes, the [extremely poor] code is available for review (will clean it up a little first!). -- ReyBrujo 15:31, 27 May 2007 (UTC)

OK - let us know when you're happy that trials are complete. For this approval, do you just want to focus on the list generation, and reapply later for the CSD tagging, or would you prefer to extend the trial and do both? Martinp23 12:33, 2 June 2007 (UTC)
 * For now, only the list generation. Unfortunately, I can only try once every couple of months, unless you consider the trials in other wikis. I was just notified the bot got approved at an:. I have downloaded the current dump of en, hopefully in three days the processing will be finished. If the dump goes fine (counting the dumps through all the other wikis), I think it should be safe (after all, it only modifies my userspace). -- ReyBrujo 16:38, 2 June 2007 (UTC)


 * --ST47 Talk 14:16, 25 June 2007 (UTC)


 * The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.