User:The Anome/Naive Bayes WikiProject classifier

This is a mini-project to create a Naive Bayes classifier to map articles to Wikiprojects, allowing the auto-classification of articles that have not yet been assigned to Wikiprojects.

Relevant links

 * Page category to WikiProject category mappings: https://quarry.wmcloud.org/query/77172 : note data takes quite a long time to load, since browser downloads 2M lines of table results
 * PetScan query to validate correlations in the cross-ref: limited to a sample of articles, and a limited number of output lines
 * A note on Laplace smoothing: https://courses.cs.washington.edu/courses/cse446/20wi/Section7/naive-bayes.pdf see also: Additive smoothing

Strategy

 * Build the cross-reference table, limited to first few million rows to avoid blowing up web browser.
 * Download this to a TSV file
 * Load this into a Python program, and build the relevant tables for a Naive Bayes classifier
 * Build a mapping from WikiProject categories to the WikiProject templates

Code

 * User:The Anome/Naive Bayes WikiProject classifier/naive_bayes.py