User:Sj/wp-api

Wikisnap API v0.1

How to create a Wikipedia snapshot config file:

Basic idea
To generate a list of articles to go into the snapshot, we read in the wiki markup on a 'config' wiki page and ignore everything that is not in an [un]ordered list. This allows liberal commenting and overlay on top of whatever other formatting is on the wiki.

Interpreted markup
 '

Image blacklist
A description/comment, ignored That last item would kill "chart.png"
 * two_MB_of_grass_growing.gif
 * bigbad.gif
 * substitute_for_bigbad.gif
 * understudy_substitute_for_bigbad.gif
 * [[Media:oblique_hacker_culture_joke_pic.jpg]]
 * Marshland in America

This format would work for any media type.

i18n page

 * en word or phrase
 * es palabra o frase

Each list (as seperated by paragraph/double newlines) describes localizations for a word/phrase. Phrases will be matched before substrings. There is no key language to each list, although using a consistent language as the first entry makes sense as a means of alphabetically organizing the lists on a page.

Having an explicit page like this avoids interlanguage link ambiguities, or inconsistencies between the one-way links between two languages. Such a page could easily be seeded by a bot from a list on one language, and tweaked. Note: this page will have ~one line for every article in the all-language snapshot.

examples
Ex. 1
 * en Wikipedia
 * simple Wikipedia
 * pt WikipÃ©dia

is preferred to

though both are equally valid
 * en Wikipedia
 * simple Wikipedia
 * pt WikipÃ©dia

Ex. 2
 * es Wikipedia, la enciclopedia libre
 * en Wikipedia, the free encylopedia
 * Italics as a flag for case-sensitive/exact matches


 * ar
 * th

Ex. 3 (equiv to)
 * 1) en disambiguation
 * 2)  pt desambigua%C3%A7%C3%A3o
 * 1) pt desambiguaÃ§Ã£o

While mediawiki creatively interprets the markup (starting the list over at 1.) the above looks fine from our script's point of view.

Ex. 4 (to blacklist the thai page)
 * 1) en Physics
 * 2) th none
 * 1) es Fisica

Other needed pages
We need pages for:
 * 1) Mediawiki verbage (eg 'navigation', 'search')
 * 2) Common in-article verbage (eg 'See also', 'External links')
 * 3) Our verbage (eg 'OLPC Digital Library')
 * 4) Our index page article catagory headers
 * 5) Header/footer text and formatting, other envelope text, and page design/css per language