User:AKA MBG/Todo

Test "Notes".

Help:
 * Help:Shortened footnotes

= Wiktionary =

Wiktionary data are heavily used in various NLP tasks (see ).

Wiktionary data in NLP
Wiktionary has semi-structured data. Wiktionary lexicographic data should be converted to machine-readable format in order to be used in natural language processing tasks.

Wiktionary data mining is a complex task. There are the following difficulties: (1) the constant and frequent changes to data and schema, (2) the heterogeneity in Wiktionary language edition schemas and (3) the human-centric nature of a wiki.

There are several parsers for different Wiktionary language editions:
 * DBpedia Wiktionary — a subproject of DBpedia, the data are extracted from English, French, German and Russian wiktionaries; the data includes language, part of speech, definitions, semantic relations and translations. The declarative description of the page scema, regular expressions and finite state transducer are used in order to extract information.
 * JWKTL (Java Wiktionary Library) — provides access to English Wiktionary and German Wiktionary dumps via a Java API. The data includes language, part of speech, definitions, quotations, semantic relations, etymologies and translations. JWKTL is available for non-commercial use.
 * wikokit — the parser of English Wiktionary and Russian Wiktionary. The parsed data includes language, part of speech, definitions, quotations, semantic relations and translations. This is a multi-licensed open-source software.

The various natural language processing tasks were solved with the help of Wiktionary data:
 * Rule-based machine translation between Dutch language and Afrikaans; data of English Wiktionary, Dutch Wiktionary and Wikipedia were used with the Apertium machine translation platform.
 * Construction of machine-readable dictionary by the parser NULEX, which integrates open linguistic resources: English Wiktionary, WordNet и VerbNet. The parser NULEX scrapes English Wiktionary for tense information (verbs), plural form and part of speech (nouns).
 * Speech recognition and synthesis, where Wiktionary was used to automatically create pronunciation dictionaries. Word-pronunciation pairs were retrieved from 6 Wiktionary language editions (Czech, English, French, Spanish, Polish, and German). Pronunciations are in terms of the International Phonetic Alphabet. The ASR system based on English Wiktionary has the highest word error rate, where each third phoneme has to be changed.
 * Ontology engineering and semantic network constructing.
 * Ontology matching.
 * Text simplification. Medero & Ostendorf assessed vocabulary diffculty (reading level detection) with the help of Wiktionary data. Properties of words extracted from Wiktionary entries (definition length and POS, sense, and translation counts) were investigated. Medero & Ostendorf expected that (1) very common words will be more likely to have multiple parts of speech, (2) common words to be more likely to have multiple senses, (3) common words will be more likely to have been translated into multiple languages. These features extracted from Wiktionary entries were useful in distinguishing word types that appear in Simple English Wikipedia articles from words that only appear in the Standard English comparable articles.
 * Part-of-speech tagging. Li et al. (2012) built multilingual POS-taggers for eight resource-poor languages on the basis of English Wiktionary and Hidden Markov Models.
 * Sentiment analysis.