User talk:The Anome/Gminas for geocoding

Gminas problem
Hi, maybe we should continue this discussion here, to avoid overloading the WP:Poland page. Can you let me know what format you have the GNS data in? Do the place names include the Polish diacritics? Are they tagged with at least the province name? It seems we will have to find an effective way of mapping between your data (i.e. names + coords + not much(?) location info) and the data we have on WP now (i.e. names + full location info + only some coords).--Kotniski (talk) 18:13, 18 March 2009 (UTC)

The data files are at http://earth-info.nga.mil/gns/html/namefiles.htm ; a description of the format is at http://earth-info.nga.mil/gns/html/gis_countryfiles.htm It's UTF-8 encoded, so there's no problem with accents.

Here's an example record, with empty fields omitted: 2 -494241 -705199 50.316667 17.366667 501900 172200 33UXR6848776519 NM33-06 P PPL PL 48 D BODZANOW Bodzanów Bodzanow 1993-12-28 where the key fields are: 50.316667 17.366667 is the WGS84 location lat/long in signed decimal degrees 501900 172200 is the same thing in signed degrees/minutes/seconds P is "populated place type feature" PPL is a generic "populated place" PL = FIPS 10-4 country code for Poland 48 = in FIPS subregion 48 within Poland D = Not verified or daggered name BODZANOW = canonical sort string Bodzanów = full name including diacritics Bodzanow = full name without diacritics Most of the other fields are unique identifiers of various sorts.

The data is far from perfect or fully comprehensive, and frequently has one-to-many and many-to-one problems, so I generally try to sanity check it against other data, such as heuristics based on article naming, content, template fields and category tree data. The combination of all these checks generally produces 99%+ reliable matches. Unfortunately, the high level of name reuse in Poland was defeating my checks, and letting far too many bad matches through.

GNS is generally rather bad at determining the difference between a populated place and the administrative district of the same name, frequently coding them as a single entry. It's also often quite bad at coding subregions: subnational region fields are often missing (coded as "00") or obsolete region data can be present. There are all sorts of other small glitches: for example GNS has "Góra Święty Małgorzaty" where Wikipedia has Góra Świętej Małgorzaty: is this a typo, or possibly a grammatical declension issue? If it's the latter, this would require a knowledge of the Polish language to perform the appropriate matchups. -- The Anome (talk) 21:01, 18 March 2009 (UTC)


 * Oh, as regards the list I said I could generate, it won't be a problem, but could you give me a week or two to finish off sorting out the last few sets of villages? Then I'll have a relatively complete list of villages for each gmina, and will be able to produce data about them in pretty much whatever form is going to be useful.--Kotniski (talk) 18:30, 18 March 2009 (UTC)


 * That's fine by me. I should also be in a better position to run my existing matching engine when you have finished populating the placename articles, which should at least catch those places in Poland with one-of-a-kind names. Can you let me know via my talk page when you are ready? -- The Anome (talk) 21:01, 18 March 2009 (UTC)

Thanks for the details; I'll have a look later. When I finish populating the articles (probably in about two weeks) I'll let you know.--Kotniski (talk) 13:37, 19 March 2009 (UTC)