User:John of Reading/Typo fixing with AutoWikiBrowser

If anyone's interested, here is how I do bulk typo fixing:

Creating the list
I download and uncompress an 80GB database dump every couple of months, and use the AWB database scanner to create my main work lists. This is the only method I know that:
 * Allows precise control over the search using regular expressions
 * Returns the full list of results
 * Does not try to guess what I really meant

I am happy to be asked to create article lists for other editors. My latest download is [ watch for a new download].

Using the Database Scanner
I use a regular expression that describes several typos at once, so that I get a long list of articles that need a variety of fixes.
 * I like to search for a prefix rather than a whole word, so that I find occurrences where editors have mangled the ending of a word as well as the beginning.
 * If the list is going to be a long one, I'll run the first 1 or 2 percent of the scan and see if I can tweak the regular expression to eliminate some of the false positives.
 * When working on the Lists of common misspellings, I use a regular expression that searches for all the words in an entire list. For example,  from the "E" list.

To use the AWB database scanner, select "Database dump" from the drop-down control just above the article list, then press "Make list". Within the scanner dialog, I choose these options before clicking "Start":
 * On the "Database" tab, I browse to my download folder and select the uncompressed database dump.
 * On the "Namespace" tab, I tick the first "Content" checkbox, which ticks everything in the first list box, then untick "Draft". These database dumps do not include any user pages, so that tick is not relevant.
 * On the "Title" tab, I skip blocks of titles within the Wikipedia namespace. I tick "Not contains", "Regex" and "Case sensitive", and paste the long regular expression, below, into the second text box.
 * On the "Text" tab, I tick "Contains", "Regex" and "Ignore comments", and paste my current search targets into the first text box.

(?:ARTICLES|Charles Magauran|Commonly misspelled English words|Cut Spelling|Date and time notation in the United Kingdom|Drexel\s+4\d\d\d|Early Cornish texts|English orthography|Henry Marshall Furman|Interspel|List of On Cinema episodes|List of the Dead Daisies members|Nairai\b|Otte Rud|SoundSpel|Transposed letter effect|OTHERS|Abuse reports|Abuse response/|Academic studies of Wikipedia|ACF Regionals answers/|Administrators' noticeboard|AMA IRC Meeting log|Adopt-a-typo|Arbitration Committee Elections|Arbitration/|Archived deletion|articles by quality log|Articles for|Articles with UK Geocodes|Attached KML/List of power stations in New Zealand|AutoWikiBrowser/Typos|BillboardEncode/|BillboardID/|Categories for|Catholic Encyclopedia topics/|Centralized discussion/|Changing username/|CHECKWIKI/|Contributor copyright investigations/|Copyright problems/|Correct typos in one click|Coverage of Mathworld topics/|Database reports/|Deleted articles with freaky titles|Deletion log/|Deletion log archive|Deletion review|Did you know nominations/|Disambiguation pages with links/|Editor review/|Featured article|Featured list|Featured picture|Featured portal|Featured topic|Files for|Find a Grave famous people/|GLAM/NHMandSM|GLAM/Your paintings|Goings-on/|Good article reassessment|In the news/|India Education Program/Courses/|Jewish Encyclopedia topics/|Jimbo Wales discussion|List of encyclopedia topics/|List of Wikipedians by|Lists of common misspellings|Main Page history/|Mediation Cabal/|Meetup/|Miscellany for|Move review/|New user log/|Pfam2pdb|Pfam2PDBsum|Picture peer review|Possibly unfree|Recent additions|Redirects for|Reference desk archive|Requested articles|Requests for|Sandbox/|School and university projects/|Shortcut table/|Sockpuppet investigations/|Stub types for deletion|Suspected copyright violations/|Suspected sock puppets|Templates for|Templates with red links|Tyop Contest|Typo Team|Unwanted Cinema cover.png|Upload log archive|Votes for deletion|Wiki Ed/|Wiki Guides/|Wikipedia Signpost/2|Wikipedia Signpost/Special|WikiProject Academic Journals/|WikiProject Chemicals/Log/|WikiProject Chemistry/IRC|WikiProject Directory/Description|WikiProject Editor Retention/|WikiProject Fix common mistakes/|WikiProject History Merge/|WikiProject Intertranswiki/|WikiProject Languages/|WikiProject London Transport/The Metropolitan/|WikiProject Missing encyclopedic articles/|WikiProject Pharmacology/Log/|WikiProject Red Link Recovery/|WikiProject Short descriptions/wd/|WikiProject Spam/|SLASH|/All discussions|/[Aa]rchive|/Article alerts|/Article list|/Article Talk list|/Articles|/Assessment|/Cleanup listing|/CurrentTranscriptions|/[Dd]ata|/Deletion archive|/Did you know|/Discussions?|/DYK|/Encyclopedic articles|/Example generated lists|/[Ff]eedback|/Fundraising|/ICC valuations|/Internet Relay Chat|/IRC|/List of all portals|/List of biographies|/List of mountains|/Listeria|/Listing by project|/Lists of pages|/Members|/Metrics/|/Newsletter|/Participants|/Peer review|/Popular pages|/Prospectus|/[Pp]ublicwatchlist|/Recent changes|/Recognized [Cc]ontent|/[Rr]edlinks|/Rename template parameters|/[Ss]andbox|/Settings/|/Stale drafts|/Stats|/Statistics|/Talk|/Translation task force|/Unpatrolled|/Watchall|/[Ww]atchlist)

Yes, there are a few article titles in this list. Some of these contain many false positives, others are where I don't wish to repeat a mistake, others are where I am avoiding a slow-motion edit war.

Settings within AutoWikiBrowser

 * I usually tick "Apply general fixes" so that AWB can make many kinds of automatic corrections to the formatting of the text.
 * I tick "Find & Replace"; and within the configuration dialog:
 * I do not tick the "ignore" checkboxes, so that I force the correction of the current misspelling even in the parts of the page that RegExpTypoFix will skip.
 * I tick "Add replacements to edit summary" so that the edit summary is as helpful as possible. For this to work properly, the "Find" strings must match all four brackets of a Link.
 * I usually start with my accumulated list of over 4,000 spelling rules.
 * Below the spelling rules, I start with a dummy "Find & Replace" rule that finds the exact regular expression that I used for the database scan and replaces it with "INVESTIGATE".
 * I tick "Skip if no replacement".
 * I tick "Skip if only minor replacements made"; within the "Find & Replace" dialog I sometimes use the "Minor" checkbox to mark rules that make a change that is valid but, I think, not worth saving by itself
 * I sometimes tick "Enable RegExpTypoFix" so that AWB will try to fix this enormous list of misspellings and manual of style issues.

May 2023: I'm currently running with General Fixes turned off because this discussion has not reached a conclusion.

May 2023: I gave up on RegExpTypoFix some years ago. I prefer to leave MOS fixes to editors who are prepared to defend them.

Checking each proposed edit
Then it's up to me to check each proposed edit.


 * If the general fixes or RegExpTypoFix changes have made the diff too long to check easily, I'll turn the option off temporarily and try again.
 * If I can't understand what the text is trying to say, I don't try to fix it.
 * If my "Find & Replace" has damaged correct text, then I may pause to think about changing the re-spelling rules to avoid the false positive.
 * If my "Find & Replace" has identified incorrect text by changing it the word "INVESTIGATE", then I'll either add or adjust a re-spelling rule and try again, or make a one-off edit to the article text.
 * If my "Find & Replace" has made an incorrect fix, I'll either adjust the settings and try again, or make a one-off edit to the article text.
 * If I'm not comfortable with the general fixes or RegExpTypoFix changes I'll either turn the option off temporarily and try again, or double-click the affected paragraphs in the upper panel to undo those changes.
 * If the changes are part of quoted text or something like a book title, I'll jump out to another window to try to check the source.

I may make other edits in the AWB edit box, fixing additional typos or correcting syntax errors that AWB has identified but not fixed automatically.

I pick one of a handful of pre-configured edit summaries, and then modify it if necessary to describe the edits I actually made. I don't always list all the typos I've fixed. I don't bother to say whether I have retained all the general fixes or undone some or them. I don't bother to distinguish the actual general fixes from the MOS fixes made by the RegExpTypoFix rules or my own rules.

I try to remember to clear the "Minor edit" checkbox if I've done anything more than simple typo-fixing or if the diff seems very long; the danger is that I forget to tick it again afterwards.

Then it's "Save" and on to the next article.

There is a danger that I'll accidentally save the word "INVESTIGATE" in an article. I check for this kind of error by running this search every day or two.

Editing quotes, book titles and such like
My regular expressions run on the whole page including quotations, book titles and so on. If I edit these, I try to leave a helpful edit summary:


 * replaced&#58; foo → bar per source
 * I found the source and was able to verify that the version at Wikipedia was incorrect. Either an earlier editor miscopied it, or, perhaps, the source has been corrected after it was copied.


 * replaced&#58; foo → bar per book cover image at Amazon/Abebooks/etc.
 * I found an image of the book with enough pixels for the words to be read clearly.


 * replaced&#58; foo → bar per a search at Amazon/Abebooks/etc.
 * I didn't find a usable cover image, but these external sites support the correction.


 * replaced&#58; foo → bar - MOS:QUOTE recommends fixing "insignificant" errors in quoted text
 * I found the source also has the error, but I've made an editorial decision to apply MOS:QUOTE.


 * replaced&#58; foo → bar - In a quote, but I'm assuming this was a copying error
 * I haven't found the source, but to me it looks like a copying error. Or perhaps MOS:QUOTE might apply.


 * replaced&#58; foo → bar for legibility
 * I didn't bother to check the source, as the change is small and the incorrect version is hard to read - something like WIlliam > William


 * replaced&#58; foo → bar
 * Oh dear. Perhaps I didn't spot that I was about to edit a quote, or I neglected to adjust the edit summary. Please revert if necessary, but it's possible that MOS:QUOTE might apply.

Skipping false positives
The best way to skip false positives is to use regular expressions with lookahead/lookbehind. This method is especially useful when doing the initial database scan, since it means the articles don't even appear in the list.

I've developed a few standard suffixes that arrange for some common false positives to be skipped. I'll tack some of them on to the long regular expression when doing the initial search, and sometimes tack them on to individual find+replace rules when needed.

I'll typically save edits to around 40% of the articles that turn up in my list, so it is important that the other 60% are skipped efficiently.

For example, the "E" list says that "exercice" may be a misspelling of "exercise". I actually searched for  so that I found "exerciced", "exercicing" and so on. As I worked through the list I gradually expanded the rule to

Alternatively I use the "Minor replacement" feature. If I can write a regular expression that describes a set of false positives, I'll add a respelling rule to change that to "FALSE" and mark the rule as "minor"

Namespaces
I will happily fix typos in most non-talk namespaces.

Common misspellings

 * The settings files are here: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

May 2023: In terms of effort per fix, this approach is no longer efficient.

Each of these lists contains a mixture of spellings. Some are easy, in the sense that most articles that contains that spelling need to be fixed. Others are not easy, because although they are incorrect spellings of English words, they are valid foreign-language words, surnames, brand names, and so on. Back in 2012 there was a backlog of easy errors which I was able to fix. Nowadays I find that other editors keep on top of the easy errors, and, despite my efforts to eliminate the false positives automatically, I'm looking through a list where most of the matches shouldn't be fixed but cannot easily be skipped automatically.

The the

 * The settings file is here

I like to scan for "the the" errors whenever I download a new database dump. My regular expression searches for
 * Either "The" or "the", followed by "the"
 * Perhaps with quotes or apostrophes in between
 * ...said it was the "the greatest thing since sliced bread"
 * ...announced the the sale of the century
 * Perhaps where the second "the" is an article title or in a piped link
 * ...the worst outrage since the the Troubles
 * ...did well in the the previous season

I don't search for "The The" or "the The" where the second "The" is uppercase. I used to, but after a while I couldn't decide whether "The The Who tour..." or "...the The Times reporter..." looked wrong or not.

More generally

 * The settings files are here: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

After each "List of common misspellings", I've been scanning for repeated words beginning with that letter. Here is the main part of the regexp for the letter "U"...



...which searches for a word beginning with "U" or "u", followed by the same word beginning with "u". I found that a search for two uppercase words found too many false positives in book/film/song titles. The words may appear inside wikilinks and may be separated by various kinds of quote mark.

The main false positives are species names. I began by telling the database scanner to skip any article containing a Taxobox or Automatic taxobox, and added a rule that guessed that any Latin-like word ending was a false positive. I later decided this was a mistake, and now deal with these more thoroughly.

Many other false positives turn up, so I add additional rules as needed.