User:Bo Lindbergh/dabalyze

This is a Perl script for finding links to disambiguation pages in Wikipedia by analyzing database dumps in the new XML format. It may not work properly in a non-Unix environment. Save it as "dabalyze" in a convenient directory. Instructions follow below.


 * input
 * The script expects to find the file "pages_current.xml" in the current directory. You can get this by downloading and uncompressing http://download.wikimedia.org/wikipedia/en/pages_current.xml.bz2


 * output
 * The script generates two text files named "articles.txt" and "templates.txt" in the current directory. The first one contains a list of disambiguation pages linked to by articles, suitable for inclusion in Disambiguation pages with links. The second one contains a list of disambiguation pages linked to by templates; this is intended for a hypothetical sub-project concentrating on the template namespace. Note that the files use UTF-8 encoding; any text editor you use for copying and pasting into Wikipedia must be able to handle that.
 * Since the script has to handle circular redirects anyway, it generates a list of them in the file "circular.txt".


 * diagnostics
 * A successful run generates diagnostic output similar to the following:

Analysis pass 1 41868 total disambiguation pages

Analysis pass 2 30385 linked disambiguation pages 30369 in the article namespace 880 in the template namespace

Report generation 514 entries written to articles.txt 880 entries written to templates.txt 100 entries written to circular.txt

The total running time is about 32 minutes on an 867 MHz PowerPC G4 (based on the database dump dated 2005-10-20).

ca:Viquipèdia:Enllaços incorrectes a pàgines de desambiguació/Scripts fr:Projet:Liens vers les pages d'homonymie/Scripts