User:UnitBot

UnitBot is a bot to fix the false precision introduced by overzealous unit conversions. It is (as of April 2012) still being coded, but now possesses core functionality.

The problem of overzealous unit conversion
A vast number of Wikipedia articles are blighted by unit conversions which are quoted to a degree of precision far greater than that of the original quantity.

To give one example, 1000 feet is formally equal to 304.8 meters. In many articles (including Trans National Place, Dan Osman, Laurel Creek Gorge Bridge, Newburgh-Beacon Bridge, Ceiling projector, Panama Canal expansion project and Altitudinal zonation) this precise conversion has at some time been used. To give a distance as 304.8m, however, implies that the value is highly precise and must lie between 304.75 and 304.85m. For this to be correct, the original figure of 1000ft would have to be accurate to within 2 inches, which is not true for any of the above cases. Therefore overprecise conversions are not just bad style, but a subtle form of error.

A casual search will reveal similar errors for almost every unit, including those of distance, area, pressure, temperature and more.

False positives
UnitBot may very occasionally make a mistake. See the false positive page for more information.

Stage 1: Avoiding pages in which precise conversions are justified
Some pages contain precise definitions, in which a disparity between the number of significant figures is perfectly justified. For example "1 foot is equal to 30.48 cm". Therefore, the following types of page are off-limits:
 * Pages in categories of the form "Units of x".
 * Pages making any explicit mention of unit conversions, significant figures or false precision (or linking to such pages) as they may contain worked examples.

Defined quantities can cause problems, as they may be a round number in one unit, and yet precise enough to justify multiple significant figures in another unit. Pages likely to have such quantities are to be avoided. These include:
 * Pages about guns and ammunition, due to specifications of calibre.
 * Pages about sports pitches, athletic tracks, and similar, due to sporting standards.  [ Not yet implemented ] 

Some pages are best left alone generally:
 * Disambiguation pages.

Stage 2: Finding unit conversions
Strings which possibly represent some number and unit combination are found. The numerical values are parsed, and the number of significant figures these values are given to is calculated. To avoid irrelevant fragments of text being falsely interpreted as units:
 * Strings are rejected if surrounded by improbable characters (e.g. anything other than spaces, commas, parentheses and similar).
 * Units are rejected if hyperlinked, indicating that some importance is attached to the unit. (This feature is implicit to the above restriction on surrounding characters.)
 * Numbers are rejected if they cannot be easily parsed. For this purpose, custom number parsing functions are used, which reject any malformed number.  (By comparison, built in PHP functions, such as floatval and intval, are designed to be accommodating of malformed numbers.)
 * Numbers are rejected if leading zeros are present, which indicate that the sequence probably isn't a numerical value.
 * Strings are rejected if within structures such as blockquotes, reference tags or transclusion brackets.  [ Not yet implemented ] 

Pairs of number and unit combinations are then linked together into unit conversions. To further reduce the chance of irrelevant fragments of text being interpreted as unit conversions:
 * Pairs of units are rejected if dimensionally inconsistent. (This feature is implicit, as quantities of different dimension are treated separately.)
 * Pairs are rejected if the quantity and the converted quantity don't match to the degree implied by the number of significant figures.
 * Pairs are rejected if the units are too far apart in the page.
 * Pairs are rejected if the units are separated by any form of page formatting feature, such as titles, new paragraphs, or changes in text styles.
 * Pairs are rejected if other units are present in between.  [Requires upgrading to accommodate number sets, e.g. 10cm x 15cm (4in x 6in) ]

Stage 3: Assessing unit conversions
Another filtering stage (this time based around individual units, rather than the entire page) is used to reject potential false positives:
 * Corrections are avoided if a number to be converted is very similar, but not equal to another number, in which case there may be a critical difference between the two numbers, which would be disrupted by rounding.  [ Not yet implemented ] 
 * Corrections are avoided if the conversion is in close proximity to words such as "defined", "equal", "standard" or "exactly", indicating some kind of definition or standard.

A two level system is then used to determine whether a correction is required:
 * Corrections will be made if the conversion shows a large disparity in the number of significant figures (for example 50 meters = 164.041995 feet).
 * Corrections will also be made for corrections with smaller disparities (for example 90 meters = 195 feet), if they are in combination with adjectives indicating inherent imprecision, such as "nearly", "over", or "approximately".  [ Not yet implemented ] 

If the conditions for rounding are met, a more sensible conversion will be decided upon:
 * Ordinarily, the conversion will be rounded to the same number of significant figures as the original number.
 * However, if the rounding acts to introduce an error of 5% or more, the number of permitted significant figures will incremented.
 * If the rounding only acts to remove 1 significant figure, then the correction will be abandoned, as it can look somewhat pedantic. However, the correction will still be made if it acts to remove a decimal point.  For example "over 500 feet (152 meters) high", will not be rounded off to 150 meters, despite this correction being perfectly justified.  However "over 50 feet (15.2 meters) high", will be rounded off to 15 metres, as the decimal point is completely superfluous.

A list of recommended text replacements in then assembed:
 * To maintain consistency, idiosyncrasies in the page formatting are retained. Whether or not a space is present between the number and the unit, and whether or not digits are grouped using commas, will depend on the original text.

Stage 4: Editing the page
The text replacements are applied to the page string, and this is uploaded to Wikipedia.

Implementation
UnitBot is written in PHP, and uses the botclasses.php class to communicate with the MediaWiki API. Stage 1 of correction algorithm is implemented by the class "page_prefilter", stage 2 by the class "find_conversions", and stage 3 by the class "fix_conversions". Stage 4 is implemented in the main script. The source code will be made available when the bot is closer to completion.

At present, UnitBot possesses the core ability to find overprecise unit conversions, and correct them. It currently lacks any ability to automatically cycle through pages.

Would you like to help?
I would appreciate help with the development of UnitBot. All PHP programmers are welcome to participate. In addition, non-technical help would be appreciated in:


 * Assembling comprehensive lists of exact unit conversion factors.
 * Suggesting special cases which may trip up the above algorithm.
 * Testing.

Blocking UnitBot
Inserting the command:

into a page will instruct UnitBot to not edit that page.

Bot approval
The bot is currently approved for a trial of human supervised test edits.

Subpages

 * Information page about false positives.
 * Template for flagging pages with bad unit conversions.
 * Inline template for flagging bad unit conversions.