Wikipedia:Reference desk/Archives/Language/2022 October 25

= October 25 =

Program for matching words across languages?
[Computational linguistics, NLP]

Suppose you have a database of sentences that have been translated into two languages and want to generate a dictionary between the two. Let's say the languages are English and Esperanto, and it turns out that most sentences in English that have the word "dog" include the word "hundo" or "hundon" in the Esperanto version, while those Esperanto words are uncommon otherwise. Of course, there could be other words that tend to co-occur with "dog", such as "kaniso[n]" (canid), "virhundo[n]" (male dog), and even spurious things like "boji" (to bark); and there could be other English words that co-occur with "hundo", like "hound", etc. It still seems like knowing the most correlated words would be good starting point for further investigation. Is there an algorithm or (even better) pre-written program to do something like this? 98.170.164.88 (talk) 01:27, 25 October 2022 (UTC)


 * If there are already dictionaries for the languages in question, such as between English and Esperanto, this is not worth the effort. Otherwise, you need a really large corpus of pairs of sentences before you can expect meaningful results. This is unlikely to be available if one of the langauges in question is so poorly documented that there are no dictionaries. Even with a large high-quality corpus available, the approach is fraught with problems. The definite article the is the most frequent word in English. The most frequent Russian word is и. They do not match up: Russian has no articles, so there is no counterpart to English the, and Russian и means "and", which in English ranks only fifth in the list of most common words. Another issue is that often what is one word in one language is a multiword term in the other language. German Salpetersäureherstellung is three words in English nitric acid production. Conversely, English nosebleed is three words in French, saignement de nez.
 * Disregarding such issues, there is a relatively simple algorithm that, however, requires a considerable amount of memory. It goes as follows.
 * For each pair of sentences, produce a list of all pairs of words, and catenate all these lists.
 * Sort this long list of pairs, and count the number of occurrences of each pair.
 * Keep the most frequent pairs for further examination.
 * If the corpus holds $N$ sentence pairs and the typical number of words in a sentence equals $L$, the unsorted list of word pairs is in the order of $NL^{2}$. If $N = 10000$ and $L = 10$, that comes out to a list of a million pairs. For a more refined metric than mere frequency, one could use an information-based metric (in the sense of Shannon entropy). Let U&thinsp;~V be a given word pair. Denote the information provided by "U occurs in a (random) sentence" by $I($U$)$, and the (conditional) information provided by "U occurs in a (random) sentence given that V occurs in the partner sentence" by $I($U｜V&thinsp;$)$. The difference $I($U$) − I($U｜V&thinsp;$)$ is indicative how much it is to be expected to find U in the company of V. If this value is high, the association is strong. If it is high in both directions, the pair is a likely candidate for the future dictionary.  --Lambiam 10:27, 25 October 2022 (UTC)


 * Tatoeba aims to collect such parallel corpora. According to the article, it has been used in statistical machine translation (Eric Nichols, Francis Bond, Darren Scott Appling and Yuji Matsumoto (2010) Paraphrasing Training Data for Statistical Machine Translation. Journal of Natural Language Processing, 17(3), pages 101–122.) This last article has a External links section. --Error (talk) 14:09, 25 October 2022 (UTC)
 * If I understand correctly, you're looking to machine-develop a dictionary of all sentences between languages based on words? I suppose that would be very difficult due to the a) wealth and breadth of word meanings, for, instance, when you wrote 'dog' I almost instantly went to 'dog-eared', which has nothing to do with man's best friend (but much more with another man's best friend); b) possible modes of translation; for instance, a translator working with a text is at liberty to break down a long sentence into shorter ones if needed, this is true say for instruction manuals and liberal on-line or hobbyist work, for literature not really so; and c) errors. However, on a small scale, this is exactly what translation memories are for within the context of so-called CAT tools, albeit mostly for specific language pairs and texts or sets of texts (say, a company may maintain such translation memories for all its documentation to maintain consistency in terminology). Hope this helps somewhat. --Ouro (blah blah) 05:30, 29 October 2022 (UTC)
 * As Groucho Marx stated; "Outside a dog, a book is a man's best friend. Inside a dog, it's too dark to read." 惑乱 Wakuran (talk) 12:53, 29 October 2022 (UTC)
 * Love that quote, thanks! --Ouro (blah blah) 15:52, 29 October 2022 (UTC)

Nihon Arupusu
I'm quite surprised that the local Japanese name of the Japanese Alps is 日本アルプス (Nihon Arupusu), a partial loanword from English: Arupusu is a phonetic rendition of Alps. Apparently it was a name given to them by William Gowland. Am I to suppose that they didn't have a name before him? I find it difficult to believe that they didn't have some kind of traditional Japanese name, considering that historically Japan had periods characterized by intense nationalism, even isolationism (for example: Sakoku, WWII). 195.62.160.60 (talk) 09:55, 25 October 2022 (UTC)


 * These Japanese Alps consist of three mountain ranges, each of which has a Japanese name: 飛騨山脈 (Hida Mountains), 木曽山脈 (Kiso Mountains) and 赤石山脈 (Akaishi Mountains). Gowland introduced the term not so much as a name but as a descriptive term, using it merely for the first of these three ranges, the northernmost. Only later did people start to think of these ranges as a collective, and Gowland's term was repurposed as a name for this collective, --Lambiam 11:01, 25 October 2022 (UTC)
 * It looks like, judging from the Japanese Wikipedia article, that an alternate name could be 中部山岳（Chūbu-Sangaku) but I'm not able to find more informations about this name. --79.13.167.156 (talk) 15:44, 25 October 2022 (UTC)
 * My rudimentary kanji knowledge, combined with snooping on Wiktionary, leads me to translate that as "Central Japan Mountain Peaks". 惑乱 Wakuran (talk) 22:36, 25 October 2022 (UTC)
 * 山岳 (sangaku) is a common term for "mountains" in the sense as used in "the Appalachian mountains", i.e. a system of mountains. So we can also translate 中部山岳 as "the Chubu mountains", in which "Chubu" is short for "Chūbu region". The term may be descriptive rather than a proper noun. --Lambiam 21:37, 26 October 2022 (UTC)
 * According to the book Politics and Culture in Wartime Japan by Ben-Ami Shillony: When the island of Singapore fell in February 1942 its name was changed to Shōnan (shō being the first character of the era name Shōwa, and nan meaning South). Similarly, the name of the Japan Alps, adopted at the height of Westernization in the Meiji era, was changed into Chūbu sangaku (Central Mountains).--79.13.167.156 (talk) 08:19, 27 October 2022 (UTC)
 * Names change. The fact that the current name for a thing includes a loan word doesn't imply that it didn't used to have a name. Especially in Japan, which seems to delight in cultural borrowing. Isaac Rabinovitch (talk) 01:34, 26 October 2022 (UTC)
 * Although it's equally plausible that three separate mountain ranges didn't have a collective name until a foreign pedant decided that they needed one. Alansplodge (talk) 12:31, 26 October 2022 (UTC)
 * It was "foreign pedants", after all, who named the Rocky Mountains, which consist of many distinct mountain ranges extending for 3000 miles. I think the pedants were French fur traders. Cullen328 (talk) 06:43, 27 October 2022 (UTC)
 * Terms like "mountain range" are rather imprecise in most common speech, properly the Rocky Mountains are a mountain system, though most people imprecisely use the word "range" to describe it. Broadly speaking, the Rockies are part of a larger grouping called the American Cordillera, which extends from the Antarctic Peninsula to the North Slope of Alaska.  -- Jayron 32 15:33, 27 October 2022 (UTC)