Wikipedia:Bots/Requests for approval/openZIM


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol delete vote.svg Denied.

openZIM
Operator:

Time filed: 20:00, Sunday November 21, 2010 (UTC)

Automatic or Manually assisted: Automatic

Programming language(s): C++

Source code available: http://svn.openzim.org/svnroot/trunk/

Function overview: reads all Wikipedia articles via MediaWiki API and creates a ZIM file from that, for details about the openZIM project and the ZIM file format see http://openzim.org/

Links to relevant discussions (where appropriate):

Edit period(s): a few times per year

Estimated number of pages affected: none

Exclusion compliant (Y/N): N

Already has a bot flag (Y/N): N

Function details: The bot "wikizim" runs on openzim.org and reads via MediaWiki API the list of all articles in the main namespace, then it retrieves the rendered HTML article data from all listed articles. The contents are being written into a ZIM file which can be downloaded on http://openzim.org/ and used with any ZIM reader on many platforms.

As the contents are already rendered it is not possible to read the to do exclusion. That shouldn't be the problem as only the main namespace is read.

As the list of all articles is queried, the API limit should be raised, as this requires less requests and per saves server ressources.

The wikizim software does not edit any pages. Its user agent is "wikizim".

Discussion
Trying to query all articles via the API or via a web crawler is not allowed, and you would end up having your IP blocked from all Wikimedia servers by the sysadmins. Why not use a database dump (once they're fixed) ? There was also talk at one time of Wikimedia generating ZIM format dumps along side the normal XML-format dumps, have you asked about this on Wikitech-l? The WP:1.0 project, whose goal is to publish sets of quality Wikipedia articles, might also be interested in putting together a ZIM file with their data. Anomie⚔ 20:12, 21 November 2010 (UTC)
 * We are in touch with the Foundation for the ZIM integration into the dumps, but this will still take a lot of time, as other dumps are currently broken and needs to be fixed first. Also there was no code yet the Foundation could use to make ZIM dumps, wikizim should fill that gap.
 * Wikipedia 1.0 is actually using openZIM and currently relies on a quite complicated process to create ZIM files which is not usable for anyone else. That's why we have created wikizim which can create ZIM files directly from the API.
 * Last but not least this is also a test case. We have successfully created ZIM files from smaller wikis, but we need to test it on a big wiki. We are not going to create regular ZIM files as we hope that the Foundation will take over that task sooner or later. We just need a live test on a live wiki. --Manuel Schneider(bla) (+/-) 20:27, 21 November 2010 (UTC)
 * While I appreciate that you want to test with a big wiki, we cannot approve such a task. Per WP:BOTPOL: "Bots that download substantial portions of Wikipedia's content by requesting many individual pages are not permitted." Your best bet would be to find someone who can supply you with a copy of the database dump and load it into a local MediaWiki installation for testing. Or you could try getting permission from the sysadmins, but I wouldn't hold my breath on that one. Anomie⚔ 20:56, 21 November 2010 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.