User:Shadowjams/AWB

The following are regular expressions I use with AWB to fix some common errors. Most of these edits are manual of style edits and should by definition be non-controversial.

Feel free to use whatever I post here in your own program or browsing. Any feedback or improvements are welcome. Please be careful with those marked unvetted, and even those that I have tested. They might occasionally catch some false positives.

Most of these should not be incorporated into the AWB typo project, because some have an acceptable level of false positives if you know they're there, but not if they're part of the anonymous typo project. Others are too CPU intensive to justify their inclusion.

Tested expressions
Consider these expressions "tested", sort of like a beta. These I've used for some time and should work well. Any common false positives I remember I'll try to point out.

American format
This will remove dates in the American format. It is tolerant towards variations in comma usage, spacing, month abbreviations, and the use of ordinal suffixes, like "3rd" or "5th", as well as different permutations of brackets.

Find: \[\[ *(January|February|March|April|May|June?|July?|August|September|Oct\.?(ober)?|November|Dec\.?(ember)?)( *[0-3]?[0-9])?(st|rd|nd|th)? *(\]\])?( ?st| ?rd| ?nd| ?th)?( *,? *)?(\[\[ *)?([0-9]{3,4}) *\]\]

Settings: None

Replace: $1$4$8$10

Issues: I've used this for quite a while and never had a false positive (in other words, it's always a date). There are some pages where date linking is acceptable though, so be careful, and also exclude those classes of articles if you can, as well as certain instances of dates that are notable, and dates that are relevant in the context. For example, December 7, 1941.

British format
This will remove dates in the British format. It's less developed than the one above, but works on the same principles. You could probably reconfigure the above for a more robust version, but this works in most cases.

Find: \[\[0?([1-3]?[0-9])(st|rd|nd|th)?[ _]*(January|February|March|April|May|June|July|August|September|October|November|December)\]\] *\[\[([1-9][0-9]{2,})\]\]

Replace: $1 $3 $4

Issues: Same as above.

Punctuation around references
This will find punctuation placed after the footnote. For example:
 * This sentence has punctuation placed improperly [1].

Corrects to
 * This sentence has punctuation placed properly. [1]

Find: [\.;,:\?]? *((]*>[^<]+]*>)|(]* / *>)) *([\.,;:\?])

Settings: Multiline

Replace: $4$1

Issues: Long strings of multiple references with improper punctuation will only move the mark up the chain. It should do this twice without issue, but three or more will cause the punctuation to be between the two. It would be possible to put in longer chains of to handle more, but that would make the expression cumbersomely long. This corrects the vast majority of problems.

Adding convert templates to measurements
This adds convert templates around common measurements. For example:
 * X is 45 mi from Y.

Corrects to
 * X is 45 mi from Y.

Find: \b([0-9][0-9\.]*),?([0-9]{3})?( | )*(mph|km\/h|mi|ml|gal|km|acres|ft|cm|km²)([ \.,:])

Settings: Multiline

Replace: $1$2 $4$5

Issues: Many articles already have the conversion inserted manually. These should usually be left alone, although sometimes these manual conversions are inaccurate and replacing them is better. The expression will not detect those on its own so you must make sure you're not duplicating the conversion. The other major issue is using them within links and with names that should not be converted. Firearms or military related articles are particularly bad in this regard. For example don't change 9 mm to 9 mm if it's discussing the firearm round. Also, other conversion values that work in the convert template will work, however some (using m for meter) will generate a lot of false positives. Also, convert won't handle full names, like mile or kilometer, so those aren't fixed by the expression.

Alpha (unvetted) expressions
If you use these make sure you carefully review every change, because I'm not confident that they won't have quite a few false positives.

Removing external links from disambiguation pages
Find: \[?https?://[^\] ]+ *([^\]]*)\]?

Replace: $1

Limit to: \{\{d(ab|isambig(uation)?)\}\}

Removing references from disambiguation pages
Find: (< *ref[^>]+>[^<]+?< */ *ref[^>]+>)|(/]+/ *>)

Replace:

Limit to: \{\{d(ab|isambig(uation)?)\}\}