Wikipedia:Bots/Requests for approval/usyd-schwa-querybot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol neutral vote.svg Request Expired.

usyd-schwa-querybot
Operator:

Automatic or Manually assisted: Automatic

Programming language(s): Python

Source code available: uses python-wikitools

Function overview: Bot that will only use the wikipedia API to retrieve full revision histories for a university research project.

Links to relevant discussions (where appropriate):

Edit period(s): No Editing

Estimated number of pages affected: No Editing

Exclusion compliant (Y/N): Not relevant. API Querying only.

Already has a bot flag (Y/N): N

Function details: The bot will simply query the API and retrieve page meta information as well as revision histories (including content). The information will be used for a research project in the University of Sydney. Results from queries will be cache locally for some time to reduce the number of queries required.

The bot will NOT make edits or page scrape.

It's here for approval to get the bot flag (and hence higher API limits, especially for revision queries).

Discussion
I should also mention that we have considered using dumps, but it doesn't seem practical at the moment. We require full revision information including contents, so we would have to obtain, decompress, and upload into a database a full dump (of which there aren't any recently). We're also only sparsely querying wiki articles (that is, we're not planning to parse all revisions of all articles), so the overhead of setting up the database seems wasteful. The idea of this bot is to retrieve the articles and histories we need then cache it locally for further use.

The API with higher limits seems like the best option. Or are there more appropriate methods? Bzho3255 (talk) 00:35, 8 April 2010 (UTC)
 * Very sorry for speedily denying this, I did not read carefully enough. How many pages will you be pulling down? If it is big enough, I can help you with an XML parser, which I already have downloaded. Otherwise, You don't need a bot account to query the database for full revision histories of pages. In fact, I suggest taking a look at Special:Export, which will probably be easier for you. Tim1357 (talk) 23:10, 8 April 2010 (UTC)


 * The current problem is that to download all the metadata for the revisions for the 9/11 page, for instance, takes 2 minutes (23 queries). The amount of data being retrieve is in fact very little (just meta so far, not content), but the overhead of 23 queries is quite a lot. Raising the limit with a bot flag will reduce this to 3 queries. The export page also limits how deep you can retrieve histories. I'd like to retrieve full histories. I already have working code to do all of this, I'd just like a bot account for higher api limits. Thanks for any assistance Bzho3255 (talk) 23:40, 8 April 2010 (UTC)


 * Last month isn't a recent enough dump? Q  T C 23:50, 8 April 2010 (UTC)


 * I was under the impression that the English Wikipedia did not produce full dumps anymore (wasn't it broken a while back?). In any case, I'd still rather avoid expanding terabytes worth of data. Bzho3255 (talk) 03:56, 9 April 2010 (UTC)


 * It would depend on how many pages you need the full history for. If you only need a few, then using the API should be fine. But if you need more than a few hundred average-sized pages, or you're specifically interested in pages like 9/11 that have thousands of edits, then you should really use the dump. Mr.Z-man 19:14, 9 April 2010 (UTC)


 * For the moment, we only need 100 pages, say. But this will change in the future. The plan is definitely to expand a dump if we ever need to scale up to thousands. But currently, we're just experimenting and querying the API seems like the best option. Not having the bot flag, however, makes querying painfully slow. Bzho3255 (talk) 04:49, 10 April 2010 (UTC)


 * You're not going to see much of a speed improvement if you get the bot flag. The bot flag allows bigger clumps to come down in each hit, but that size increase isn't huge.  There may be optimizations you could perform to improve performance.  Josh Parris 13:09, 10 April 2010 (UTC)


 * Querying revisions can be done 500 at a time without bot flag and 5000 with. As I've previously mentioned, this is a reduction from 23 queries to 3 for the a large article like 9/11. The overhead of querying the API appears to be the bottleneck as the amount of data across the 23 queries is only a few megs, but still takes 2 minutes. Bzho3255 (talk) 04:57, 11 April 2010 (UTC)


 * A full dump was recently produced for the first time in 4 years; compressed it's 32Gb, and expands out to 5+Tb. I'm not sure you want that.
 * Would an editing-denied bot account be appropriate for your uses? Josh Parris 05:02, 5 May 2010 (UTC)

Are you still intending to proceed with this request? Josh Parris 09:52, 15 May 2010 (UTC)

Josh Parris 11:11, 25 May 2010 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.