User:KalenTheGreat/sandbox

International Corpus of English Improvement Notes
History

Contribution

LesleyMich1 (talk) 21:16, 22 March 2018 (UTC) I want to add how the idea started. The history could use a few other sentences about who's idea it was and who made it happen. I do this in the beginning of the history section.

History

LesleyMich1 (talk) 21:16, 22 March 2018 (UTC)

Sidney Greenbaum’s goal to compile corpora that would compare the syntax of world English became the ICE project that was achieved by Professor Charles F. Meyer. The corpora are used by researchers to compare the syntax of the varieties of English. The project began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Twenty-three research teams around the world are preparing electronic corpora of their own national or regional variety of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989. For most participating countries, the ICE project is stimulating the first systematic investigation of the national variety. To ensure compatibility among the component corpora, each team is following a common corpus design, as well as a common scheme for grammatical annotation.

Initial Comments
There is a lot that can be added to this article. The International Corpus of English is widely used and cited, and certainly has a history longer than a single paragraph. Some of our initial ideas to improve it would be to find influential linguistics papers that used the ICE as a corpus. We can also dig deeper into the structure and make-up of the corpus and elaborate on that. We can probably also elaborate on the different countries that participate.

What I plan to contribute
I plan on adding most to the description of the corpus itself. There is a lot of information about the internal structure of a corpus that readers may want to know, and it's certainly an important topic. I will also be researching popular academic papers and experiments that used the corpus for their research.

Description (my section)
Each corpus contains one million words in 500 texts of 2000 words, following the sampling methodology used for the Brown Corpus. Unlike Brown or the Lancaster-Oslo-Bergen (LOB) Corpus (or indeed mega-corpora such as the British National Corpus), however, the majority of texts are derived from spoken data.

With only one million words per corpus, ICE corpora are considered very small for modern standards. ICE corpora are contain 60% (600,000 words) of orthographically transcribed spoken English. The father of the project, Sidney Greenbaum, insisted on the primacy of the spoken word, following Randolph Quirk and Jan Svartvik's collaboration on the original London-Lund Corpus (LLC). This emphasis on word-for-word transcription marks out ICE from many other corpora, including those containing, e.g. parliamentary or legal paraphrases.

The corpora consist entirely of data from 1990 or later. The subjects from which the data was collected are all adults who were educated in English and were either born, or moved at an early age, to the country to which their data is attributed. There are speech and text samples from both men and women of many age groups, but the corpus website makes it a point to note that, "The proportions, however, are not representative of the proportions in the population as a whole: women are not equally represented in professions such as politics and law, and so do not produce equal amounts of discourse in these fields."

The British Component of ICE, ICE-GB, is fully parsed with a detailed Quirk et al. phrase structure grammar, and the analyses have been thoroughly checked and completed. This analysis includes a part-of-speech tagging and parsing of the entire corpus. The treebank can be thoroughly searched and explored with the ICE Corpus Utility Program or ICECUP software. More information is in the handbook.

To ensure compatibility between the individual corpora in ICE, each team is following a common corpus design, as well as a common scheme for grammatical annotation. Many corpora are currently available for download on the ICE official webpage, though some require a license. Others, however, are not ready for publication.

Textual and Grammatical Annotation
Researchers and Linguists follow specific guidelines when annotating data for the corpus. The three levels of annotation are Text Markup, Wordclass Tagging, Syntactic Parsing.

Textual Markup
Original markup and layout such as sentence and paragraph parsing is preserved, with special markers indicating it as original. Spoken data is transcribed orthographically, with indicators for hesitations, false starts, and pauses.

Greenbaum, S. (1996). Comparing english worldwide: The international corpus of english. New York;Oxford;: Clarendon Press.

Word Class Tagging
Word Classes, also called Parts of Speech, are grammatical categories for words based upon their function in a sentence.

British texts are automatically tagged for wordclass by the ICE tagger, developed at University College London, which uses a comprehensive grammar of the English language.

All other languages are tagged automatically using the PENN Treebank and the CLAWS tagset. While the tags are not corrected manually, they are checked regularly for quality.

Syntactic Parsing
The sentence are parsed automatically and, if necessary, are manually corrected with ICECUP, a syntax tree editor created specifically for the corpus.

Dependency parsing is also done automatically with the Dependency Parser Pro3GreS. The results are not manually verified.

Design of the Corpora
Below are the subsections of the ICE, with the number of corpora for each category and sub-category in parentheses.

Participants
The current list of participant countries are (*= available):
 * Australia
 * Cameroon
 * Canada*
 * East Africa (Kenya, Malawi, Tanzania)*
 * Fiji
 * Ghana
 * Great Britain* (parsed)
 * Hong Kong*
 * India*
 * Ireland*
 * Jamaica*
 * Kenya
 * Malta
 * Malaysia
 * New Zealand*
 * Nigeria* (tagged)
 * Pakistan
 * Philippines*
 * Sierra Leone
 * Singapore*
 * South Africa
 * Sri Lanka
 * Trinidad and Tobago
 * USA

Greenbaum, S. (1996). Comparing english worldwide: The international corpus of english. New York;Oxford;: Clarendon Press.

Preliminary Bibliography
In addition to the websites below, we can also find the individual websites for each country that participates in the program.

http://www.ucl.ac.uk/english-usage/projects/ice.htm

http://ice-corpora.net/ice/

http://onlinelibrary.wiley.com/doi/10.1111/j.1467-971X.1996.tb00088.x/abstract

http://www.helsinki.fi/varieng/CoRD/corpora/ICE-GB/

http://www.jbe-platform.com/content/journals/10.1075/ijcl.7.2.02bol

http://onlinelibrary.wiley.com/doi/10.1111/j.1467-971X.1988.tb00241.x/full

https://books.google.com/books?hl=en&lr=&id=iljWOhT2n-8C&oi=fnd&pg=PR11&dq=International+corpus+of+English&ots=JJLa8l_Jn6&sig=pIYZTukj1LC1P4HFHBuMUX4BPSI#v=onepage&q=International%20corpus%20of%20English&f=false