User:MZMcBride/climax

climax is the code name for a project that gathers and analyzes a set of attributes of biographies of living people in an attempt to programmatically find problematic biographies.

Attributes
A number of attributes of the pages are collected using Python scripts and are inserted into an SQLite database. The database will be released to the public (with the exception of one attribute&mdash;the number of page watchers). Below is a table of the raw attributes collected. Other attributes will be derived from this data.

Analysis
The value in this data comes from the analysis of it. climax will focus on a scoring system. Other users may be interested in performing their own analysis to examine certain trends or problem areas.

Technical details
Going to split this into a few separate scripts. Dump scanner goes first. Then need to retrieve various props from a text file and from the database....

{{collapse bottom}}
 * climax-dump-props.py
 * Implemented
 * Total page length
 * Total number of bad words
 * Total number of "<ref"s
 * Total number of "[http"
 * Presence of reference banners
 * Not implemented
 * Number of bad words within X bytes of {{cn, etc.
 * Number of bad words within X bytes &lt;ref / [http
 * climax-database-props.py
 * Implemented
 * none
 * Not implemented
 * Date of first edit to page
 * Date of last edit to page
 * Total number of revisions
 * Number of page watchers
 * climax-views-props.py
 * Implemented
 * none
 * Not implemented
 * Number of page views from bh.txt (rename this file...)
 * climax-scorer.py
 * Implemented
 * none
 * Not implemented
 * Need to devise a proper scoring chart
 * over 9000 :PP


 * Test cases
 * Michael Austin
 * Frank Reagan
 * Paweł Piskorski

Bad words
Urgently need to add case sensitivity support here.

Words definitely need case sensitivity:
 * dick
 * evil
 * traitor
 * arrested
 * psycho

Words to possibly remove from the bad word list:
 * steals (lolbaseball)
 * investigations

\babusing\b \babuse\b \babused\b \babducted\b \babduction\b \baccuse\b \baccused\b \baccusation\b \ballege\b \balleged\b \banus\b \barrest\b \barrested\b \barse\b \bass\b assault\b assaulted\b asshole bastard bitch bloody bollocks \bbribe\b \bbribes\b \bbribed\b bugger \bcharges\b child molester child molestor child predator child predater \bcocks\b convict\b convicted\b \bcorrupt\b cunt\b \bdick\b dumbass espionage \bevil\b fag\b faggot\b faggots\b fags\b \bfired\b \bfled\b \bflee\b fraud\b \bfuck\b \bfucks\b \bfucked\b is gay \bghey\b guilty had an affair \bhates idiot \bimpeach insane insanity investigation jackass \bkilled\b \bliar\b \bliars\b \blie\b \blied\b lol\b \blying malpractice molest\b molested molestation molesting murder\b murdered\b murdering\b mutant neglect neglected negligent \bnigger paedophile parole pedophile psychiatric \bpedo\b \bpsycho\b \bpussy\b \bracist \brape\b \braped\b \braping \bscandal\b sexual assault sexually assault \bshit \bslut\b \bsluts\b \bslutty\b \bsteal \bstole\b \bstupid\b \bretarded\b \bretard\b \btheft\b \btits\b \btwat\b \bwanker your mom \bcharged\b \bsentenced\b in jail\b \btraitor

To-do

 * case sensitive bad words (e.g., "dick")
 * add column in the database table for page creator text
 * add all code to code repo
 * prefix database names properly
 * version views columns (e.g., nov_09_views)
 * test.db includes non-articles
 * need to add "reference_headers" count
 * differentiate bad words vs. very bad words
 * track Google hits?
 * track incoming page links from other articles?