English-Arabic Parallel Corpus of United Nations Texts

The English-Arabic Parallel Corpus of United Nations Texts (EAPCOUNT) is one of the biggest available parallel corpora involving the Arabic language. It is intended as a general research tool, available beyond the present project for applied and theoretical linguistic research. It started as a PhD research project at the Department of Linguistics, University of Carthage, in 2006 by Dr. Hammouda Salhi (حَمُّودَة الصّالحي), in collaboration with some of his students, and completed in 2010. The whole description of the corpus was completed in 2009 and revised in 2010.

The EAPCOUNT project comes as a response to the unsatisfactory performance of general-purpose dictionaries (Zanettin, 2009), especially when it comes to translation studies and comparative research involving Arabic. It was also motivated by the increasing demands for cross-lingual research and information retrieval (Salhi, 2010).

The EAPCOUNT comprises 341 texts aligned on a paragraph basis, which means texts in English along with their translational counterparts in Arabic. It consists of two subcorpora; one contains the English originals and the other their Arabic translations. As for the English subcorpus, it contains 3,794,677 word tokens, with 78,606 word types. The Arabic subcorpus has a slightly fewer word tokens (3,755,741), yet differs greatly in terms of the number of word types, which is 143,727. This means that the whole corpus contains 7,550,418 tokens.

Texts included in the EAPCOUNT
The EAPCOUNT consists mainly, but not exclusively, of resolutions and annual reports issued by different UN organizations and institutions. Some texts are taken from the authoritative publications of another UN-like institution, namely the Inter-Parliamentary Union (IPU); representing 2.18% of the total number of tokens in the English subcorpus. But the great majority of texts are issued by the General Assembly and Security Council (66.44% SL tokens). The assumption here is that TL texts produced by these selected international bodies can be considered as translations of a high degree of reliability. All texts have been downloaded from first-hand sources (official websites of these agencies) in order to make sure that the publications are all kept in their original form.

Time-frame
The EAPCOUNT texts cover a time-frame of about 14 years. The EAPCOUNT can be taken as a synchronic corpus, even though Meyer (2002:46) maintains that “a time-frame of 5 to 10 years seems reasonable” for a corpus to fit into the category of synchronic corpora. This is because almost all original texts and translations are issued by the same bodies and are governed by strict norms and standards of writing and translation, which may arguably mean that language change happens at a slower pace. In addition, 22.6% of the texts were produced in 2009, 16% in 2007, and 13.4% in 2005, and 93.87% of the texts were produced over a period of 9 years, namely from 2001 to 2009, or within the reasonable time-frame set by Meyer for a synchronic corpus.

Main sources of EAPCOUNT texts

 * General Assembly Resolutions:  http://www.un.org/ga/64/resolutions.shtml
 * Security Council Resolutions: http://www.un.org/Docs/sc/unsc_resolutions.html
 * UNICEF Publications: http://www.unicef.org/publications/index.html
 * International Monetary Fund Publications http://www.imf.org/external/arabic/index.htm