User talk:PocklingtonDan/Spelling bot

Proposal
I am aware that such bots have been proposed before and rejected, so before proposing this in an RFA, I'm going to try and get something hammered out here instead before requesting approval. I think it would be handy to have an automated (non-assisted) spelling bot.

The Why Everybody makes spelling mistakes, its not a problem that is going to go away. Human editors have enough to do correcting typos etc without having to patrol for "systematic" spelling mistakes too. Spelling mistakes in main article namespace affect the professional image of wikipedia.

The Why Not Int he past people seem to have objected that such a bot would be indiscriminate, and cause too many false positives, "correcting" spelling mistakes that were not spelling mistakes at all.

What I propose All objections can be negated by applying sufficient conditions of operation to the bot, I believe:
 * The bot would monitor text submitted in recent edits
 * The bot would monitor edits to main article namespace only, not talk or user
 * The bot would not edit words embedded in wikilinks, assuming these are special cases. (I assume that includes everything between double square brackets, double curly brackets, etc., whether "link" or not (think eg image syntax, ); best also to include anything in wikitables (between {| and |} ), and anything between html/xml-like tags, including the text *in* the tags, eg ;   paralllelepipido  , etc. --Francis Schonken 20:19, 12 January 2007 (UTC))
 * It seems fair to extend this rule to cover any words in any paired tags, be it {}, [], <> or whatever - PocklingtonDan 21:02, 12 January 2007 (UTC)


 * The bot would not edit words with multiple spellings in different languages or language variations.
 * The bot would not edit words that are ambiguious and could be intended to be more than one word.
 * The bot would not edit words in html links, assuming these are special cases.
 * The bot would not edit words with first-letter captilaziation (to ignore all proper nouns)
 * The bot would not edit words within quotes/blockquotes (added after a suggestion) —The preceding unsigned comment was added by PocklingtonDan (talk • contribs) 16:26, 12 January 2007 (UTC).
 * The bot would not edit words without a word boundary on both sides
 * The bot would not edit words with "(sic)" in next 100 chars
 * The bot would not edit words if the same word appeared int hat spellign in the article title
 * The list of words matched would be available on the bot page.
 * No new words to be added without second bot RFA


 * (suggestion added by Francis Schonken 15:29, 13 January 2007 (UTC):) If the bot encounters more than (say) 4 or 5 errors on the same page, it doesn't change any spelling, but adds copyedit to the top of the page (rationale: many trivial spelling errors on the same page might indicate something more fundamental is wrong, e.g. written in broken English overall: repairing 5 words without checking spelling overall, and syntax of the sentences etc might easily lead to silly improvements, that rather show incompetence than a useful bot)
 * (suggestion added by Francis Schonken 15:44, 13 January 2007 (UTC):) The bot doesn't handle pages which have the spelling template (rationale: this might seem contradictory, but seen the limited scope of spelling errors that can be handled by automatic bot, the bot has no way of knowing whether it corrected all errors - in which case the template should be removed - or only part of them - in which case the template should remain)

Proposed list of changes
Rather than trying to catch all typos, the bt would watch for the approximately top 100 most common "systematic" spelling errors, based on list_of_common_misspellingsand others. The proposed list is (note, there might be some mistakes in this list currently that need correcting):
 * I can't agree to this list. I posted a few comments, just picking a few I thought wrong on sight, and then checking them in a dictionary... they were wrong for every one of them I checked - note that my native language isn't even English.
 * Suggestions:
 * No unattended bot corrections on words of less than 10 letters (or maybe 9, but I would certainly not go lower than that for fully automatised - possible ambiguities decrease with length of word);
 * Start with a list of maximum 10 or 20 words, and have these words checked, and counterchecked, and checked again before proceeding.
 * Let bots stay clear of issues that might be perceived as differences between National varieties of English (cooly/coolie etc). --Francis Schonken 19:53, 12 January 2007 (UTC)
 * I absolutely want to stay clear of national-language variations, as stated above. I posted this list here precisely so it could go through a review process and be improved, not because I was stating what was definitely going to be used. If you have suggestions for new words, pelase add them. I will respond to each individual case below. I will also check every word using my own dictionary - the list was posted here off a webpage of most common errors.- PocklingtonDan 20:04, 12 January 2007 (UTC)
 * What did you think of the other two suggestions? --Francis Schonken 20:21, 12 January 2007 (UTC)
 * My apologies for not explicitly responding to these, I'll do that now:
 * With regard to not correcting words of less than 9 or 10 letters, this seems arbitrary to me, but I am happy to remove any words where there are two or more possible alternate words the user plausbily meant to type.
 * I agree the list should probably be narrowed down a little, possibly by that much, but again not arbitrarily, but by removing any words that look like they could plausbily have been intended to mean two or more alternate words. I'm flexible though, it wouldn't be a disaster to start with a smaller set of words to check for.
 * Thanks for the continuing feedback! PocklingtonDan 21:01, 12 January 2007 (UTC)
 * Re. "avoid short words (less than 9 or 10 letters)" - not arbitrary, I explained why: "possible ambiguities decrease with length of word". Even seemingly straightforward ones like "adress -> address" all too often have possible alternative minor permutations that lead to a different result (e.g. "adress &rarr; a dress"). Note that all improvement suggestions to the list thus far only affected these shorter words (which is some sort of corroboration to this point too). The possibilities for "alternative sense-making minor permutations" are less frequent for the long complex words, in sum:
 * it are particularily the long complex words that are tedious to correct for a human and easily overlooked (e.g. getting the "l"s right in "paralellogram"), and that can easily be done by an automated process. The errors in shorter words (eg "anallyse") are usually easier to spot by humans, and it might be assumed that such words are thus more easily corrected by them.
 * short word corrections very often need human assistance anyway, so this is no job for a fully automated bot, e.g. "annalyst" &rarr; "annalist" or "analyst"? One immediately spots something is "wrong" with "annalyst", but one would need context to decide with which "correct" word to replace it;
 * Re. "start with a shorter list", I'm primarily speaking test run here: better to have a successful (not-complained-about) testrun with 20 words than have 100 words in the first run, with one complaint, which would block (many) people to see the advantages, and might make them withdraw support for further experimentation. I include myself in those "(many) people". Sorry about that, I just try to be honest. Current guidance is clear: no fully automated spelling correction bots: one disturbing error would make me think: we were right in the first place, why would we change policy on this issue? Note that also the previous point (the word length issue) should be read as "in a first stage" for the same reasons, because of the more obvious pitfalls as described above. Of course, if the testrun is successful, some shorter words could be included if we're really, really sure there's no ambiguity involved, eg "tyrrany -> tyranny" might be a good candidate for deployment in a second stage, but I would not include it in the test run list. These shorter words almost always need much more proofing than the long ones. --Francis Schonken 12:18, 13 January 2007 (UTC)
 * Thanks Francis for some really excellent input on this, I think I agree with both your points completely, on reflection. I will formalise everything that has been discussed and start a bot RFA early this coming week. Thanks - PocklingtonDan 17:59, 13 January 2007 (UTC)


 * accelarate -> accelerate
 * accessable -> accessible
 * accessery -> accessory
 * accidently -> accidentally
 * accomodate -> accommodate
 * accomodation -> accommodation
 * acustomed -> accustomed
 * annoint -> anoint
 * aquaintance -> acquaintance
 * adress -> address
 * antenatal -> antinatal
 * assassination -> assasination
 * batallion -> battalion
 * cemetary -> cemetery
 * changable -> changeable
 * commitment -> committment
 * concensus -> consensus
 * coolly - > cooly coolly as adv. of "cool" in OED; why "cooly" and not "coolie" (both in Webster's; OED has "coolie")? - REMOVED
 * concensus -> consensus
 * corollory -> corollary
 * definately -> definitely
 * desiccate -> dessicate
 * desiccated -> dessicated
 * dispair -> despair
 * desparate -> desperate
 * developement -> development
 * disippate -> dissipate
 * dissappointed -> disappointed
 * difference -> differense # wrong way round. RF
 * drunkeness -> drunkenness
 * ecstasy -> ecstacy both in Webster's: OED has only "ecstasy", so probably "national varieties of English" issue.
 * elevater -> elevator
 * embarrassment -> embarassment
 * excede -> exceed # Excede is a web design company
 * existance -> existence
 * febuary -> February - UPDATED
 * grammer -> grammar # grammer means grandmother
 * guarantee -> garantee "guarantee" exists both as a verb and as a noun in Webster's - REMOVED
 * harass -> harrass both "harass" and "harras" in Webster's. And BTW also "haras" in Webster's. But "harrass" does not seem to exist.
 * harrassment -> harassment
 * heros -> heroes "heros" might be transliteration of "ἣρως", and is used as such in the context of Greek mythology (see heros)
 * independant -> independent
 * idiosyncracy -> idiosyncrasy
 * inadvertent -> inadvertant
 * indispensible -> indispensable
 * inoculate -> innoculate
 * irresistable -> irresistible
 * irritible -> irritable
 * insistant -> insistent
 * judgment -> judgement
 * liason -> liaison
 * libary -> library
 * liquefy - > liquify
 * momento -> memento "momento" as synonym of "memento" in Webster's; further, why not momento &rarr; momentum? or: momento &rarr; moment? - REMOVED
 * millenium -> millennium
 * mischievious -> mischievous
 * miniscule -> minuscule - UPDATED
 * noticable -> noticeable
 * ocassion -> occasion
 * ocassional -> occasional
 * occurence -> occurrence
 * parallel -> paralell
 * persue -> pursue persue &rarr; peruse?? - REMOVED
 * pityful -> pitiful- Both are accepted
 * possess -> posess # could be a mispelling of posies? - or: posses (plural of "posse")? or: posers?
 * procesed -> processed - UPDATED
 * priviledge -> privilege
 * privelege -> privilege
 * reccomend -> recommend
 * recieve -> receive
 * refered -> referred
 * relavant -> relevant
 * repitition -> repetition
 * sacreligious -> sacrilegious
 * sieze -> seize # could be a typo for size
 * seperate -> separate
 * spatial -> spacial - REMOVED
 * subpena -> subpoena both "subpena" and "subpoena" in Webster's (both synonymously both as verb and as noun) - REMOVED
 * supersede -> supercede both "supersede" and "supercede" in Webster's - REMOVED
 * transfered -> transferred
 * tyrrany -> tyranny
 * unparalelled -> unparalleled
 * wastefull -> wasteful
 * wieght -> weight # could be a typo for wight
 * wierd -> weird # could be a typo for wired
 * yeild -> yield
 * Zeroes -> Zeros

Responses
Can I get some measured response to this proposal please? I want to try and hammer the proposal into shape such that it stands a chance of getting through the bot RFA process. Please suggest anything you can think of to improve its operation etc. Cheers - PocklingtonDan 15:41, 11 January 2007 (UTC)


 * An automated spelling bot will never be approved by the BAG as per bot policy Betacommand (talk • contribs • Bot) 17:57, 11 January 2007 (UTC)
 * Bot policy is not handed down from the gods, it is something that someone or some group of people once decided, ie it was the consensus when it was written. It can be changed if the consensus now is that it is acceptable under certain strict conditions. Are you stating that it is necessary to go through the process of changing the policy first, and then going throught he justification all over again when getting authorisation for the bot? - PocklingtonDan 18:37, 11 January 2007 (UTC)
 * I have started a request to change this policy here, please comment if you have any views on this - PocklingtonDan 18:43, 11 January 2007 (UTC)
 * It has been agreed, policy-wise, that if you can get a bot that can Pass BRfA we could give an exception IF AND ONLY if this bot has a zero chance of error. Comments on the proposal should therefore be posted below.  John Broughton  |  Talk 16:21, 12 January 2007 (UTC)


 * What I am saying is that this is not a job for a bot, ask to have the words added to AWB as a fully automated bot has too many issues and risks. Also the community has said that they don't want spellchecking bots. This is a task for humans to do as the room for error is very small Betacommand (talk • contribs • Bot) 18:49, 11 January 2007 (UTC)
 * What are the issues and risks involved in such a bot? At worst, it would match a false positive and change the spelling of a word that was deliberately mispelt. Based on the outline above, I believe the chances of that happening are incredibly small. It would only be operating on 50-100 of the most commonly mispelt words and I have outlined various safeguards above to prevent false positive matches. Worst case scenario, it corrects a word it shouldn't have. I don't see why that is so dire, given that AVB's cockups when they occur are far worse, actually restoring vandalism in some cases. I agree with the basic premise that a bot cannot catch all typos, so the bot doesn't try and do this, it tries to catch just the 50-100 most common systematic spelling errors.


 * What I am really after is comments on improving the rules/restrictions to reduce the possibility of false positive matches. Thanks - PocklingtonDan 19:51, 11 January 2007 (UTC)


 * I think the issue is that this is too much changing to turn over to a bot. You say only change the most obvious ones, but these most obvious ones are also the least burden on the user.  The more useful the change, the more likely the error.   You could have a script which found the misspellings, and then asked a human to approve them article by article, so that a human takes responsibility for the change.  The human might approve the change without checking it carefully, but the person takes responsibiility.  When I am doing disambiguation, I check the subject, and judge it's seriousness.  If I disambiguated an article about a rock band, and now it says "In September, they fish for bass ", I figure that's just the price of playing poker.  But if an article is about an emperor, I read the changes more carefully.  Also when an article is pourly written, I treat it as less serious.


 * Which makes me ponder the prospect of seriousness. Perhaps we should have an official rating of seriousness.  If no one has marked it serious, the bots can edit anything.  The more people who mark it serious, the more careful bots must be.  If an article is controversial, it will promptly be marked off limits to robots.  A scheme like that could work, and would seriously ease the quality burden.  -- Randall Bart 09:25, 12 January 2007 (UTC)


 * Fixing spelling errors are potentially the sort of thing that should be automated so humans don't have to do it. If the robot makes too many mistakes, it should be fixed or turned off.  But arguing that it shouldn't even be tried because it might be a problem is, well, ill-reasoned, I think. Even if it's true that these most obvious ones are also the least burden on the user, they still represent some burden on the user.  And spelling errors, while they exist, distract from doing more important changes.


 * My primary concern would be with quotations - obviously we don't want a bot touching those. So, for example, if a section contains a "blockquote" tag, I suggest that the bot skip the section entirely.  Similarly, if there is a quotation mark (say) within the 300 characters preceding something that looks like a spelling mistake, then the bot should avoid this.


 * What we absolutely don't want is false positives. It's okay to err on the side of false negatives - where the bot ignores a misspelled word for one reason or another.  But I think it's possible to put in enough safeguards to eliminate false positives.  John Broughton  |  Talk 16:17, 12 January 2007 (UTC)


 * It's nice to have a voice of reason :-) I absolutely wish to prevent all false positive matches, and that's why I have laid out, want discussion of, and am willing to ammend, the list of restrictions on when the bot makes spelling copyedits. I too beleive that it is possible with community input to reach a consensus on a "safe" set of restrictions on the bot that limits when it edits spelling mistakes. Also these "systematic" spelling mistakes are the least suited to correction by human editors since the chances are good that most human editors don't know off-hand the correct spelling. I have added the quote proposal above, I think it is a good one. Can people just confirm there are no errors in my proposed list of spelling corrections too please? Cheers - PocklingtonDan 16:30, 12 January 2007 (UTC)

(undent) A second concern is that a spelling bot could obscure vandalism. By that I mean that an anonymous vandal hits a page and makes a spelling error; 60 seconds later, the spelling error is fixed by SpellingBot; and on everyone's watchlist, the most recent edit is now by SpellingBot, not an anonymous vandal. I realize that watchlists can be tailored to ignore bots, and to display more than the last edit, but those are not the defaults. So you might consider not correcting spelling errors until (say) 24 hours after they occur, or at least some delay to give VandalBot and human editors a chance to revert an edit entirely before you try to fix spelling mistakes.

Also, regarding the spelling list, what I think is needed is a safeguard whenever words are added. For example, you might only add words once every (say) two or three months (on a regular schedule), and have a talk page specifically to (a) announce the proposed new words and (b) provide a waiting period of at least 7 days (14 would be better) for responses before the words are added. John Broughton |  Talk 16:41, 12 January 2007 (UTC)


 * A delay to allow for removal of edit by anti-vandal bots etc seems reasonable. However, there is always going to be some slight problem of obscuring vandalism because *whenever* the bot writes to the page, it is going to obscure the previous edit. This may be an unconnected, later vandal edit. If the bot had to wait an hour or 24 hours after the most recent edit to the page (rather than just the mis-spelling edit) then on even medium-traffic pages, it would never be permitted to write its edit!


 * I think adding words to the list to watch would occur infrequently at best, since the top 100 mispelt words is fairly static. But I would be sure to post up proposals and invite comment on them before comment, absolutely.


 * PocklingtonDan 18:09, 12 January 2007 (UTC)


 * IMHO, spell check is already available client-side (such as SpellBound)for most editors, and could be implemented server side (perhaps though a parternship with a spell checking vendor) in a maner that would PREVENT the spelling errors, rather than going in and repairing them as a new edit. A server side implementation would also allow for the possibility of ignoring the suggestion or disabling the feature per editor.  Adding or increasing the spell check capability of other client side solutions is generally welcome when done judicially (such as an AWB plugin).  —  xaosflux  Talk  05:16, 13 January 2007 (UTC)


 * A large percentage of edits are done by anonymous editors (admittedly, most minor, but the absolute number is still large) who aren't likely to be interested in installing anything client-side. For that matter, neither are the vast majority of new editors who have fairly low edit counts.  I'm all in favor of server-side solutions, but given other priorities, it could be years before this implemented; why not implement something that helps, in the meantime?


 * On a different subject, the original proposal said No new words to be added without second bot RFA. If that's done, please don't do this frequently - once per month or less frequent, please.   Also, I don't see why you would want to pick an arbitrary number like "100" as the number of words to be processed by the bot.  If you can find 550 well-defined problem words, why not do them all?


 * Finally, I think we can get hung up here on the list, when what should really be at issue is the concept. If the concept is approved and a trial run with (say) 10 words works well, then would be the time, I think, to come back with a larger list and get comments on that, and that only - much clearer and cleaner.   John Broughton  |  Talk 15:46, 13 January 2007 (UTC)


 * Thanks for the input, John. I'm in absolute agreement that something built into the wiki software itself would be ideal for this, but then the same could be said for "vandal" edits to - and as with them until someone has the time and wherewithall to code them into the wiki software directly, anything that gets part-way there through another method seems helpful to me. i think I now have enough feedback and comments to sit back and compile a proper proposal and move to bot RFA process early next week. My thanks to everyone for a reasoned discussion of the various benefits and drawbacks. - PocklingtonDan 18:03, 13 January 2007 (UTC)

For what it's worth, a belated response:

The length of that list above, of all the things that "The bot would not edit...", should give one pause. Any time a list is that long and contains that many ad-hoc conditions, it's a virtual certainty that the list is incomplete, that there are still more ad-hoc exceptions waiting to be identified.

People have complained about the problem of false positives, but an issue at least as big is the question of false negatives. By the time you program the bot sufficiently conservatively -- with all those "The bot would not" conditions, and more -- so that it would make relatively few wrong edits, there are huge numbers of right edits that it wouldn't make, either, that some person would have to come along later and fix manually or semiautomatically anyway.

A second question is the difficulty of actually implementing all those "The bot would not" conditions. Arguably, by the time you'd implemented and thoroughly tested them all, you could have done an awful lot of semiautomated spellchecking in the same time.

I fully agree that misspelled words are a problem, and that they make the encyclopedia look unprofessional. However, I have to also agree with the reasoning behind the existing bot proposal, which declines to consider automated spellchecking as a viable bot possibility. To properly automate (i.e. semiautomate) the task of spellchecking Wikipedia, I would recommend writing a script that: Hmm, this would be pretty easy. Perhaps I'll start writing it myself. :-) —Steve Summit (talk) 00:54, 4 February 2007 (UTC)
 * 1) Greps a Wikipedia dump for a candidate misspelled word or words (perhaps drawn from the list above).
 * 2) Fetches the pages and checks whether they still contain the misspelling.
 * 3) Checks the misspellings against some of the "The bot would not" conditions, and also a stop list (see below).
 * 4) Presents the remaining pages to a human editor, perhaps via AWB.
 * 5) Keeps a "stop list" of words-in-context which are accepted misspellings, so as to avoid asking about them again.
 * 6) Somehow allows the user (in step 4) to augment the stop list, i.e. by giving a three-way "Correct, accept, defer" prompt for each candidate misspelled word.  ("Correct" means correct the misspelling.  "Accept" means accept the misspelling, by adding it to the stop list.  "Defer" does nothing, meaning that the word might be prompted for again later.  The "Accept" versus "Defer" distinction is important, because correctly accepting a deliberately-misspelled word is a significant step, which may require research, which the human editor might not feel like performing just then.)