Wikipedia:Bots/Requests for approval/WalkingSoulBot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol keep vote.svg Approved

Operator:

Time filed: 19:49, Thursday May 19, 2011 (UTC)

Automatic or Manual: Manual

Programming language(s): PHP

Source code available:

Function overview: Research / educational purposes on the distribution of word count and interwiki links over a period of time.

Links to relevant discussions (where appropriate): Bachelorproject Computer Science @ VU (Free University of Amsterdam)

Edit period(s): None

Estimated number of pages affected: None

Exclusion compliant (Y/N): N

Already has a bot flag (Y/N): N

Function details: The bot will be crawling among random articles to gain the amount of words / interwiki links / categories / languages over a period of time.

Discussion
Bots are typically not allowed to download large portions of Wikipedia, instead they should use a database dump. Would that fit your needs? — HELL KNOWZ  ▎TALK 19:53, 19 May 2011 (UTC)


 * The amount of data the bot would be downloading would be a minimum since most of the calculation is done through the API. From what I have been testing they could be retrieved accordingly through:


 * api.php?action=query&prop=links&pllimit=5000&titles=Article
 * api.php?action=query&prop=langlinks&lllimit=500&titles=Article
 * api.php?action=query&list=search&srprop=wordcount&srlimit=1&srsearch=Article
 * api.php?action=query&prop=categories&cllimit=500&clshow=!hidden&titles=Article

Which return a 'small portion' each with format either xml or txt.

The dumps I have come across are mostly of large proportions. For the purposes of my research I would only need a smaller amount. WalkingSoul (talk) 20:07, 19 May 2011 (UTC)


 * How small? Headbomb {talk / contribs / physics / books} 18:03, 20 May 2011 (UTC)
 * Until I have spoken with my research group I cannot say a definite amount. The amount of revisions could be done in steps of 5 or larger to not have to go through all the revisions of an article and gain some performance. Could be around 10k or more. This would be in the same order as this article. A start is to research the distribution of word count versus internal wiki links. The ultimate goal of this research is to gain insight in the development of an article over a period of time in terms of word count versus internal links. WalkingSoul (talk) 18:15, 20 May 2011 (UTC)


 * I don't have access to that article. BTW, there is Special:Export (see Help:Export) which might (or might not) be useful to you. Headbomb {talk / contribs / physics / books} 18:20, 20 May 2011 (UTC)
 * Fixed the url. Either way, I have used Special:Export and made a Wikipedia 'parser' which counts all of the above mentioned. However, those are only an estimation of the real values. The API is more accurate when it comes to these values and has them pre-calculated?

Make sure you use reasonable delays between consecutive API calls and avoid parallel calls unless necessary to conserve the bandwidth. Also make sure to observe "maxlag=5" (WP:Maxlag). Otherwise, I see no problems with a read-only analysis bot for research/education. — HELL KNOWZ  ▎TALK 08:41, 22 May 2011 (UTC)


 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.