User:The Anome/Placeopedia study

Placeopedia data
I've just done a test run of adding coordinates from Placeopedia to articles using The Anomebot2.

The results so far:

The database contained 18368 entries, of which, after removing article names that the bot had already logged as geocoded from previous runs, the bot judged around 6000 to be worth inspecting.

The file was UTF-8 encoded. 17628 of the entries had all-ASCII names. 723 had correctly UTF-8 encoded non-ASCII characters in their names. 17 entries had their names stored as undecodable byte sequences, which make sense neither as UTF-8 nor as mojibake.

During a truncated test run, 4022 of these 6000-or-so articles were visited by the bot (an estimated 200-or-so more were visited but didn't exist, but their names weren't logged at the time), of which:
 * 1164 articles were judged to be eligible for editing by the bot (most of the rejections were because the article had already been tagged by someone else), of which:
 * 402 had previously been marked with coord missing, and all seemed to be suitable for geocoding
 * 714 had not previously been marked with coord missing, of which:
 * 653 seemed to be suitable for geocoding
 * 61 were, after manual inspection of the list of edits and manual inspection of articles with potentially unlocatable-sounding names, found to be about topics which should not have been geocoded, and had to be de-tagged by hand

Out of those 61, here are some examples of wrongly-tagged articles.
 * A few are silly: House, and Del.
 * Some less silly examples: Hermia, Grain elevator, Cyclotron, Cenacle, Corniche, Fountain square, Hypocenter. Many of these are cases of insufficient disambiguation or insufficient specificity: for ezample, they may refer to the science park in Finland instead of the mythological figure, or a particular fountain square, grain elevator or hypocenter.
 * Some examples that are matters of policy or judgement: Internal Revenue Service, CBC Radio 3, Soca music, American Chemical Society.

61 out of 1164 is an error rate of about 5%, which is uncomfortably high: I generally consider that bot-generated edits need to be at least 99% accurate in order to make sure that bot operation raises, rather than lowers, overall average data quality, and reduces, rather than increases, the requirement for human labour.

I haven't yet done any systematic review of coordinate quality, so I don't know how accurate the actual locations are. A very quick random check of 11 of the marked articles that were about validly geocodable topics all seemed to have reasonable coordinates; however, this is not a large enough sample to gain any real idea of the error rate.

-- The Anome (talk) 01:35, 25 December 2008 (UTC)