User:The Anome/Geocluster analyzer

A new project: the Geocluster analyzer.

We now have almost 1 million geocoded pages, and I believe we may be approaching critical mass for a new way to geocode pages.

The old matching algorithm
The existing geocoding bot software works by cross-correlating articles on Wikipedia by matching (stemmed name, article type, country) triples, where, on the Wikipedia site: and, on the GNS side: with a wide variety of heuristics added to avoid false positives in each case.
 * Stemmed names are derived from stripping suffixes from Wikipedia article names.
 * Article types are derived from traversing the Wikipedia category tree, via only links related to the same feature type, based on keyword analysis of category names.
 * Countries are derived from traversing the Wikipedia category tree to those articles, via only feature-related links, again based on keyword analysis of category names.
 * Stemmed names are derived from the GNS name of the feature
 * Feature types are derived from the GNS feature code
 * Countries are derived from the GNS CC1 country code

Manually created mapping tables have been used to detect keywords in categories and map these onto GNS feature codes. In some cases, words in article titles can be used to help make feature matches.

Articles are then only considered to match GNS features when: Once all these are true, the feature is considered a match, and provisionally eligible for geocoding.
 * A triple derived from Wikipedia matches a triple derived from the GNS data, and
 * There is only one such triple derived from Wikipedia (ie only one article of that name for that particular type, in that country)
 * There is only one such triple derived from the GNS (ie only one feature of that name for that particular type, in that country)

A few countries, such as Poland, Japan, and Canada are excluded from this processing, because they had a high rate of re-use of similar feature names within the same country, and poor coverage in both Wikipedia and the GNS, sometimes giving one only of each feature in error, and triggering a false match.

The article on Wikipedia is then examined at run-time, and a variety of heuristics applied to its categories and content to detect: and only when all of those have been satisfied, is a "coord" template added to the article bearing the GNS coordinates. Based on a mixture of manual checking, and checking reports of errors from other editors, this procedure is at least 99% accurate: possibly as accurate as one error in a thousand articles. This compares well with manually added geodata.
 * articles already containing coordinates
 * redirects
 * disambiguation pages
 * lists
 * articles about types of things, rather than instances of those things
 * articles about people
 * articles about fictional topics
 * articles about extraterrestrial objects
 * articles about animals, plants and other living things
 * articles about non-compact features such as roads or railway lines
 * a wide variety of other niche cases

The new proposed matching algorithm
In theory, the old matching method could be improved by using finer-grained information about sub-national regions and sub-regions to further disambiguate locations. However, the GNS data is very poor at coding sub-national regions, there is no clear mapping between Wikipedia regional categories and GNS subnational regions, and the GNS does not in any case distinguish regions any finer that first-order sub-national regions.

So I'm going to try a new method.

[details to follow]