User:Tresoldi/NLP

Here is a list of resources for my on-going project of statistical tranlation of Wikipedia:

Corpora
Wikipedia dumps - the main source for my own corpora. Free download, licensed under CC-BY-SA 3.0 License.

Europarl - Collected for the Moses project, it is a parallel corpus descibred asx "European Parliament Proceedings Parallel Corpus 1996-2009". Available in Danish, German, English (UK), Spanish, Finnish, French, Italian, Dutch, Portuguese (PT), and Swedish. Free download, supposedly free even for commercial use.

OPUS An Open Source Parallel corpus, very good initiative. Free.

Wacky - Acronym for Web-As-Corpus Kool Ynitiative. They provide corpora collected from the web for Italian, German and English, tagged and lemmatized. Authorization is requested for download. No information on license.

TIGERCorpus The TIGER Treebank (Version 2.1) consists of app. 900,000 tokens (50,000 sentences) of German newspaper text, taken from the Frankfurter Rundschau. The corpus was semi-automatically POS-tagged and annotated with syntactic structure. Moreover, it contains morphological and lemma information for terminal nodes. Free for non commercial use, commercial use upon request.

BNC-British National Corpus The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written. Commercial, can be queried.

American National Corpus Equivalent of the BNC? A sample is free for download, no license information given.