User:Crispy1989/CluebotNG Metrics

The new Cluebot works by calculating various statistics on edits and feeding them into a neural network. To handle edits in real-time, these statistics have to be calculated extremely quickly. Things like regular expressions are too slow for this - it requires specialized algorithms. Because of this, some of the statistics (called metrics) need to be hard-coded into the program (in C/C++). Some, however, can be configured at run-time.

This configuration of metrics is done with XML files. Most of the XML files contain specifications for "metrics". A metric is often just a single statistic, or sometimes, is the sum of multiple statistics. See the following pages for information on configurable metrics. Any help in adding to these is appreciated.

The files are in XML format. To view them properly, click on edit. They contain XML comments explaining what the entries mean.

If you add or change these, please leave an explanation of your additions or changes in the page discussion, and if it's not obvious, add an XML comment before the line(s) you add explaining it.

The general format of the XML file is a list of metrics (each of which has a name), where each metric contains a list of things to search for. The occurrence counts are added together to form the metric. If you are unsure of the format, contact me.


 * Basic Metrics
 * These are metrics applied to the BEFORE and AFTER (CURRENT and PREVIOUS) revisions. They include strings of punctuation to search for (punctuation is stripped out before diffing, so punctuation searches must go here) and certain characters to count.
 * Text Metrics
 * This contains phrases to search for that contain spaces. Single words to search for should go in the word groups.
 * Edit Comment Metrics
 * This contains phrases to search for in edit comments. This is mostly useful to search for automatically made edit comments.
 * Word Groups
 * This contains groups of words to search for. Around 50 different word groups, each containing related words, would be ideal.  Too many is not good.