User:The Anomebot2

Note: Blocking will stop further edits: the bot will intermittently retry errors for several minutes, but should then automatically shut itself down until restarted manually; please use a ten minute block or longer to be sure of stopping it.

This bot is designed to add standardized machine-readable geodata records to relevant articles in the English-language Wikipedia, using data from GNS, GNIS, OSGB coordinates in UK articles, plaintext geodata scraped from article text, and interwiki-linked geotag data from other-language Wikipedias. -- The Anome 12:13, 22 September 2007 (UTC)

Status

 *  125,000+  geotags added to date, based on data from a variety of sources


 *  102,000+  more articles identified as potentially eligible for tagging have been marked using coord missing


 *  61,000+  articles reformatted from old geotag formats to use coord

- The Anome (talk) 14:10, 13 September 2008 (UTC)

Update: As of 2009-06-12:
 *  365,800  distinct articles edited -- The Anome (talk) 17:08, 12 June 2009 (UTC)

Recent activities
Currently backfilling a number of corner cases missed by earlier over-cautious heuristics, using:
 * machine parsing of plaintext geodata found in dumps
 * automatched GNS data
 * interwiki-matched machine-readable geodata from other language editions

This is very laborious for the bot, as it requires the re-scanning of large numbers of false positives, and will result in only a few hundred articles being geocoded, but machine time is cheap, the re-scans are necessary in any case, and this will lay the foundations for larger systematic efforts to come later.

-- The Anome (talk) 14:30, 14 September 2008 (UTC)


 * Done. -- The Anome (talk) 04:29, 15 September 2008 (UTC)

Current activity
Finishing adding a large number of coord missing tags. Almost complete. -- The Anome (talk) 23:50, 13 October 2008 (UTC)

To do
Geotags:
 * Standardize existing geotags. "coor title *" is now done, "coor *" pending.
 * Finish adding "coord missing" to all eligible articles.
 * Ancient sites should be templatable as missing, while still blocked from being given coordinates automatically.
 * Go back and use CatScan to find any remaining franchises mis-tagged as "coord missing".
 * Rebuild article state map from log file and other stored data.

Interwiki:
 * In the absence of up-to-date Kolossus data, start using the externallink table API to live-scan non-en: Wikipedia editions for URLs in order to obtain interwiki patterns
 * Use full interwiki data to regenerate fuller tags where only KML data was used for earlier tagging.

Consistency and correctness:
 * Use 1-degree-tile binning to look for outliers
 * Look for misuse of coord tags for offplanet locations: report to WikiProject for fixing

Matching:
 * Hierarchical matching with disambiguation by subnational entities; rejected some time ago because ineffective, but may have become possible with greater navbox systematization in last year
 * Open research topic: Bayesian inference of relative locality from the link graph -- this may be an effective way of handling the above. Use places with known locations as training set.
 * Properly handle undersea features and disputed territories with no applicable recognized country
 * Types of places not yet keyword-matched during graph traversal:
 * Casinos
 * Resorts
 * Historic districts [?]
 * Ports and harbo[u]rs by country
 * Bus and some metro stations

New data sources:
 * Collect lists of country-specific coordinate data
 * Mine geodata from images included in articles (thanks to User:Planemad for the suggestion)

Infoboxes:
 * Scan for unusual/broken parameters in infoboxes.
 * Start work on standardizing infoboxes.

-- The Anome (talk) 13:47, 12 October 2008 (UTC)

Forthcoming attractions
With >70,000 data points, I now have enough data to do a spatial analysis of the category tree, and to generate lists of possibly misclassified or mislocated outliers. The cleaned up bounding data could then be used as a Bayesian classifier for future work. -- The Anome 10:14, 24 August 2007 (UTC)


 * The category+link graph may be a better choice for this. -- The Anome (talk) 13:59, 12 October 2008 (UTC)

Ambiguity problems
Because of severe name ambiguity problems, -- The Anome (talk) 13:58, 12 October 2008 (UTC)
 * Japanese locations are now filtered out of most machine-matched geodata sets.
 * Recent Canadian data has had similar problems, and is now also filtered from the output of several matching algorithms.
 * Because of numerous bizarrely-formatted disambig pages which confused the matching algorithms, Polish locations are also filtered.