User talk:Azylber/RaBOTnik/Task0/Question1

Comments
Thanks for compiling this list, Azylber! I can very well imagine that people put all kinds of crap into templates; I'd be surprised if lang-ru were an exception :)

Regarding catching and replacing Latin letters which look like Cyrillic ones, I think that would be an excellent side benefit. There is seldom a good reason to mix Latin and Cyrillic within one word. Even though it is occasionally done for visual effect (KOЯN and TETЯIS come to mind), doing so is most certainly undesirable when the letters look identical (such as "a" and "а" in your example). This, by the way, is indeed a more common problem than one would imagine (and very hard to catch manually, too), and I'm still figuring out just why the hell it is so common!

Regarding "incompatible stress marks", I'm sure some of those are plain mistakes, but there are quite a few homographs which differ only in stress. I've added some comments in your list where it's the case. I don't know if your script can be made sophisticated enough to distinguish a last name from a patronymic, but if it can be done, this should take care of a good chunk of those homographs. The patronymic versions are bound to be more common, whereas the differently stressed last names typically belong to non-Russian people of Slavic ethnicities.

Toponyms, on the other hand, are a whole different story. Beyond the most common ones (of the likes of "Kamenka", "Ivanovka", "Krasny", etc.), it's just a land of wild guesses. No native Russian speaker would be able to tell you, for example, whether "Уда" is supposed to be "У́да" or "Уда́", unless they consulted a dictionary, live in that place, or happened to hear it in a movie or something (and even so, this still does not account for possible homographs). Some human names share this problem as well, although with human names it's not as common.

I hope all this doesn't dampen your spirits, though! An automated task like this is an enormous undertaking, and I'm afraid that so far you've only scratched the surface :) But in the end, even if your bot run would have to be limited only to the most common findings, I think overall it would still be a great improvement. I'm happy to help with what I can. Cheers,—Ëzhiki (Igels Hérissonovich Ïzhakoff-Amursky) • (yo?); September 25, 2013; 13:11 (UTC)


 * Thanks for all the feedback!
 * As regards the incompatible pairs, I see that there are some that you were not sure about and some you left blank, so I think we could have maximum 1 or 2 more people look at the list. If then you guys are still not sure, then RaBOTnik will have to simply ignore those ones. I've also invited Ymblanter to take a look.
 * As regards the entries containing mixed characters, RaBOTnik will only replace automatically the ones that look exactly the same, such as that Барковa example.
 * As regards distinguishing a surname from a patronymic, some sophistication is easy to achieve, for example if 3 words are supplied to lang-ru, the bot can try to evaluate the 2nd one as a patronymic and the 3rd one as a surname. This could potentially lead to some errors for toponyms that are made of exactly 3 words, but there aren't very many of them. Azylber (talk) 18:31, 25 September 2013 (UTC)
 * On the last one, you are right, there aren't that many toponyms which are made of three words, and of those there are hardly any which would look like human names (as for the two-word toponyms, Yerofey Pavlovich comes to mind).
 * As far the sophistication goes, you might also want to include a comma check for those three words. It's not uncommon for the Russian editors to include the Russian spelling in the "LastName, FirstName Patronymic" format, since that's what the default is in the Russian Wikipedia. If there is no comma, then it's probably safe to assume that the format is "FirstName Patronymic LastName". Cheers,—Ëzhiki (Igels Hérissonovich Ïzhakoff-Amursky) • (yo?); September 25, 2013; 18:52 (UTC)