User talk:Nanobear~enwiki/Regional stats DB

A comment
I just wanted to point out one thing&mdash;and it may seem trivial, but is not (and it will cost you points on a test if you are taking database theory class :)). The linking should never be done by the entity's name; i.e., if your database contains different stats by federal subjects, you need to have a separate table listing the federal subjects with a unique id assigned to each. That unique id field will be that table's primary key, and its values would be the values of the foreign keys. All in all, you should never ever link the data using fields which have meaning beyond serving as a unique identifier.

Other than that, it will only take mapping the ids in your database to the ids in my database to be able to connect the two datasets. Cheers,—Ëzhiki (Igels Hérissonovich Ïzhakoff-Amursky) • (yo?); April 25, 2011; 17:30 (UTC)


 * Well I did take a database theory class, but for some reason I wasn't taught one should never use attributes which have a real-world meaning as keys. (Perhaps because the examples in the textbook were simplified). I thought the name (e.g. "Kaluga") would be unique (only difficulty would be "Moscow" vs. "Moscow Oblast"), but I guess it's indeed better to create an artificial id for each region (like the integers {1,...,89}). Or would it be better to use ISO 3166-2:RU codes as primary keys, since they seem to be unique for each region? What do you think? Nanobear (talk) 17:48, 25 April 2011 (UTC)
 * Yes, for simplified textbook examples unique ids are often omitted. Doing so in the real world, however, means lots and lots of pain later on (trust me on that one :)). Changing the database structure is easiest when you are developing the said structure, not when you already have mounds of data loaded.
 * As for the ids, I actually use two unique identifier fields for the federal subjects&mdash;one with artificial ids (1 through 83 in the order they are listed in the Constitution, basically), and another one with the ISO codes (which are indeed unique). From the database theory point of view it's redundant, but in practice it is often rather convenient. For entities below the federal subject, I use artificial ids. For your purposes, you should choose whichever approach suites you better when you work on the scripts and process the data. As I said before, creating a lookup table with one field containing my ids and another one containing your ids would make it possible to interlink and query both databases no problem; regardless of what kind of ids we use. At any rate, I imagine most work on your side revolves around creating the scripts, not the database structure!—Ëzhiki (Igels Hérissonovich Ïzhakoff-Amursky) • (yo?); April 25, 2011; 18:26 (UTC)
 * By the way, here's a case in point why you shouldn't be using the names of federal subjects as primary keys. Right now your primary concern is with getting the data loaded in, and for that purpose it doesn't really matter what names you use for the federal subjects, as long as you can identify them unambiguously. But when the data are loaded, you will probably want to write a script which produces tables suitable for insertion into the articles. At that point, you'll probably find that you'll need to change the names of the federal subjects in the database (because, say, they are in Russian and you need them in English, or they are in English, but you need them to match the titles of Wikipedia articles, or you need both Russian and English names, or something else of that nature). If you have one table with the list of federal subjects and their ids, all you need to do is to change the names in that table. No other table will be affected, because the ids will remain the same. But with the existing approach, you'll need to change the names in every single table and to really watch for those typos in the process. Now, I assume that MySQL, like Access, has some referential integrity mechanism that automatically changes the linked data when the primary key is changed, but even so the chance of screwing something up along the way is substantially higher than when you are doing it right from the start. And once the relations between the tables become more complex, even referential integrity won't save the father of the Russian democracy :)—Ëzhiki (Igels Hérissonovich Ïzhakoff-Amursky) • (yo?); April 25, 2011; 18:56 (UTC)