Wikipedia:WikiProject Disambiguation/Database dump analysis

A database dump is a backup of all Wikipedia pages, which can then be downloaded. Once downloaded, extensive analysis can performed on the dump (this can't be done by scraping live from the servers because it creates excessive load).

Database dump analysis can help WikiProject Disambiguation achieve its goals by providing editors with extra information.

Currently run dump analyses

 * The Disambiguation pages maintenance report is refreshed by a dump occasionally. Generated by RussBlau.
 * Generated by Bo Lindbergh (details):
 * The Disambiguation pages with links report. Status - updated every couple of months, as needed.
 * From portals, a variation on link repair
 * From templates, a variation on link repair
 * Malplaced disambiguation pages report.
 * Statistics, example:

Proposal: tracking down dab pages with suspect style
At WP:DAB wangi expressed interest in using the dumps to aid dab page style (by tracking down suspect dab pages). One could argue that Category:Disambiguation pages in need of cleanup is always plentifully stocked and that a dump analysis to find more troublesome dabs is unnecessary. But then again, who could have perceived the activity around From templates that resulted in completion of that report.

Ideas
Dab pages are checked for:
 * Image and template checks...
 * Images
 * Templates (other than dab templates naturally, including stubs templates etc)
 * Images and templates indicate that a dab page is verging on article status. An expert can examine the dab and perform merging, start discussion etc.


 * Talk page is a redirect?
 * If a page has a dab template then it should have its own talk page. Due to page moves, often a dab's talk page redirects elsewhere (no redirect should be present). A listing of dab pages without their own talk pages would be helpful.


 * Link checking...
 * Check the ratio of wikilinks to number of lines for page. The idea being that the higher the value the more in need of cleanup a page is (generally).
 * Check for piping of links. Generally piping should not be present on dab pages. Perhaps check the gross number of piped links