Wikipedia:Reference desk/Archives/Computing/Early/wpfsck

Wpfsck is an application written by Triddle and Andrew Rodland which scans the English Wikipedia for errors and inconsistencies. The program is written in Perl and takes its name from the Unix fsck utility. Currently the program can generate reports for WikiProject stubsensor, Most wanted stubs, and Multiple redirects in about 40 minutes on an 800 MHz PowerPC G4.

At its core wpfsck is an extensible architecture built around the concept of cleanup projects and designed specifically with Wikipedia in mind. Because of this additional cleanup projects can be added easily and is encouraged. If you have an idea for a systematic cleanup project please leave a note at the section. If you currently run a cleanup project you may wish to consider consolidating with this project; see.

Cleanup projects

 * These reports were generated from the database dump as of Jun 23, 2005.

Stubsensor
The stubsensor project attempts to programatically identify articles that have grown beyond a stub but still have their stub tag. The version of Stubsensor in wpfsck features new statistical analysis and bayesian filtering techniques to identify the offending stubs. It is interesting to note that this new stubsensor identified articles that the original Stubsensor missed, even from the same database dump. This shows a lot of promise for this new technique. The top 10 stubs from this report are:


 * List_of_Illinois_State_Routes
 * Guarani
 * Clementi,_Singapore
 * History_of_democracy
 * Buenos_Aires_Province
 * Naha,_Okinawa
 * Argentia,_Newfoundland_and_Labrador
 * Hugo_Award_for_Best_Professional_Artist
 * Hugo_Award_for_Best_Professional_Editor
 * Panathinaikos

Double redirects
Double redirects occur frequently but are easy to detect and fix.


 * List_of_solo_piano_pieces,_Contemporary
 * Southern_Nootkan
 * FreeCraft
 * Zec.
 * Hovedside
 * Trichloroethane
 * Pre-Indo-European_Europe
 * Razzie_Award_Worst_Actress
 * Malegueta_pepper

Most wanted stubs
The Most wanted stubs report gives the list of stubs with the highest number of links to them. This list is ordered with the largest number of links at the top. Here are the top 10 as generated by wpfsck:


 * Depth 1244
 * Cincinnati/Northern_Kentucky_metropolitan_area 254
 * Inverclyde_Line,_Glasgow 125
 * Durham_Coast_Line 120
 * Graphics 118
 * Stepanakert 109
 * Synthetic_radioisotope 109
 * Sun_Coast 105
 * Natchez_District 98
 * Khojaly 97

Consolidation
Consolidation of cleanup projects may make sense in some circumstances:
 * If you are performing cleanup projects via SQL and it is very time consuming then it is likely wpfsck can do it faster.
 * If you only download a database dump to perform a cleanup project then we can eliminate needless bandwidth consumption.
 * Wpfsck will soon feature automatic cleanup project publishing, so if you don't wish to manually perform that task I can have wpfsck perform it for you.

Even if you don't want to consolidate you may find the Perl module at the heart of wpfsck, Parse::MediaWikiDump, useful. You may also wish to run your own copy of wpfsck if you perform many cleanup projects.

Comments

 * Please feel free to leave your comments, ideas, and suggestions for new cleanup projects.


 * It would be useful to have the source available here. -- Beland 01:36, 5 October 2005 (UTC)
 * I can definitely make the source available but honestly its not a great program. The modularity is a hack (everything is a hack really), but it does work. It also needs to be updated to the new dump file format which means porting to Parse::MediaWikiDump and an iterator interface instead of a callback interface to the dump file. I had started on a complete rewrite of wpfsck but I've since been dragged into school and work and I've got no idea when I'll have enough free time to complete the project. Oh yea, there is no documentation, and I really don't have the time to document it properly; let me know if you would still like the old source code. Triddle 15:52, 5 October 2005 (UTC)
 * The source code would still be good. It allows others to get in on the act. --*Wilfred* (talk) 20:09, 9 March 2006 (UTC)
 * Okie dokie, I created a tarball of the source code and put it at http://tylerriddle.com/wpfsck-0.01.tar.gz - you can contact me (my info is in the README) if you would like some help sorting through the code. I hope it is useful and it works out that someone can bring it back into working order. :-) Triddle 15:50, 10 March 2006 (UTC) The source code is lost - does anyone have an archive of it? Triddle 00:31, 10 November 2006 (UTC)