User:Dengyuliang/Cross-language information retrieval

Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query. The term "cross-language information retrieval" has many synonyms, of which the following are perhaps the most frequent: cross-lingual information retrieval, translingual information retrieval, multilingual information retrieval. The term "multilingual information retrieval" refers more generally both to technology for retrieval of multilingual collections and to technology which has been moved to handle material in one language to another. The term Multilingual Information Retrieval (MLIR) involves the study of systems that accept queries for information in various languages and return objects (text, and other media) of various languages, translated into the user's language. Cross-language information retrieval refers more specifically to the use case where users formulate their information need in one language and the system retrieves relevant documents in another. Cross-lingual information retrieval needs to solve the following main problems:

CLIR is based on different methods of querying and different processing methods, but this processing is before the IR system query. And according to the taxonomy developed by Oard & Dorr, It is divided into three Main Approaches to CLIR:
 * Correspondence between languages
 * Word ambiguity and polysemy
 * Word segmentation in question
 * Multilingualism of target document
 * How to sort the output results


 * Machine Translation-based approaches.
 * Thesaurus-based approaches.
 * Corpus-based approaches.

For the MT（machine translation) Approach, MT Approach has two steps: (1) Search query translation and (2) Target document translation. the query sentence will be automatically translated from the source language to the target language. The second step, Target document translation, will perform offline translation before searching for the target, and then search for the cached translation. The advantage of the MT Approach is that it is simple and direct and usually has a faster retrieval speed. The disadvantage are that (1) it is relatively easy to make mistakes in shorter queries, especially for specific nouns,  (2)Offline translation can be very time-consuming and consume a lot of storage space, (3)  Inherited most of the weaknesses of the MT and MT system.

For the Thesaurus-Based Approach, Thesaurus is a resource that organizes the terminology of a domain of knowledge, in Thesaurus-Based Approach we will use a multilingual thesauri to process the CLIF, Each term in the thesaurus uniquely specifies a concept, and the target document is marked with the concept in the thesaurus and then execute the query. The query can be directly queried, or also can execute Concept Retrieval. The advantage of the Thesaurus-Based Approach is high efficiency and clear unambiguous mapping. But its disadvantages are that: expensive thesaurus, the target document needs to be marked, the scalability is insufficient, and it is limited to the specific field of the predefined thesaurus.

For the Corpus-Based Approach, it uses the statistical information in parallel corpora to query, which is usually based on two retrieval principles: target documents that frequently use the query terms are more relevant than those use query terms infrequently, and rare query terms are more useful than common query terms. The advantage of using parallel corpora is that it can include updated terminology, but it is also limited by the domains of the corpora.

CLIR systems have improved so much that the most accurate multi-lingual and cross-lingual adhoc information retrieval systems today are nearly as effective as monolingual systems. For languages with a small number of users, it is even more accurate than monolingual systems. Other related information access tasks, such as media monitoring, information filtering and routing, sentiment analysis, and information extraction require more sophisticated models and typically more processing and analysis of the information items of interest. Much of that processing needs to be aware of the specifics of the target languages it is deployed in.

Mostly, the various mechanisms of variation in human language pose coverage challenges for information retrieval systems: texts in a collection may treat a topic of interest but use terms or expressions which do not match the expression of information need given by the user. This can be true even in a mono-lingual case, but this is especially true in cross-lingual information retrieval, where users may know the target language only to some extent. The benefits of CLIR technology for users with poor to moderate competence in the target language has been found to be greater than for those who are fluent. Specific technologies in place for CLIR services include morphological analysis to handle inflection, decompounding or compound splitting to handle compound terms, and translations mechanisms to translate a query from one language to another.

The first workshop on CLIR was held in Zürich during the SIGIR-96 conference. Workshops have been held yearly since 2000 at the meetings of the Cross Language Evaluation Forum (CLEF). Researchers also convene at the annual Text Retrieval Conference (TREC) to discuss their findings regarding different systems and methods of information retrieval, and the conference has served as a point of reference for the CLIR subfield.

Google Search had a cross-language search feature that was removed in 2013.