User:Htyue1/sandbox

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late twentieth century from a wide variety of genres with the intention that it be a representative sample of spoken and written British English of that time.

History
The creation of the BNC started in 1991 under the management of the BNC consortium and the project was finished by 1994. There have been no additions of new samples after 1994 but the BNC underwent slight revisions before the release of the second edition BNC World (2001) and the third edition BNC XML Edition (2007).

Background
The BNC corpus was the vision of computer linguists whose goal was a corpus (collection of texts) of modern (at the time of building the corpus), naturally occurring language in the form of speech and text or writing that could be analyzed by a computer. Hence, it was compiled as a general corpus to be made readable by computers to pave the way for automatic search and processing in the field of corpus linguistics. One of the ways BNC was to be differentiated from existing corpora at that time was to open up the data not just for the use of academic research, but to commercial and educational uses as well.

The corpus was restricted to just British English and was not extended to cover World Englishes, partly because a significant portion of the cost of the project was being funded by the British government which was logically interested in supporting documentation of its own linguistic variety.

Because of its potentially unprecedented size, the BNC required funds from the commercial and academic institutions as well. In turn, BNC data then became available for commercial and academic research.

Description
The BNC is a monolingual corpus as it records samples of language use in British English only, although occasionally words and phrases from other languages may also be present. It is also a synchronic corpus as only language use from the late twentieth century is represented; the BNC is not meant to be a historical record of the development of British English over the ages. From the beginning, those involved in the gathering of written data sought to make the BNC a balanced corpus and hence looked for data in various communication mediums to do with writing.

Written Corpus
90% of the BNC constitutes samples of written language use. These samples were extracted from regional and national newspapers, published research journals or periodicals from various academic fields, both fiction and non-fiction books, both published and unpublished material such as leaflets, brochures, letters, essays written by students of differing academic levels, speeches, scripts and many other types of texts.

Spoken Corpus
The remaining 10% of the BNC constitutes samples of spoken language use. These are presented and recorded in the form of orthographic transcriptions. The spoken corpus consists of two parts: one part is demographic, containing the transcriptions of spontaneous natural conversations produced by volunteers of various age groups, social classes and originating from different regions. These conversations were produced in different situations, including formal business or government meetings to conversations on radio shows and phone-ins.

The other part involves context-governed samples such as transcriptions of recordings made at specific types of meeting and event. All the original recordings transcribed for inclusion in the BNC have been deposited at the British Library Sound Archive.

Technical information
The corpus is marked up following the recommendations of the Text Encoding Initiative and includes full linguistic annotation and contextual information. The most recent edition, from March 2007, is distributed in XML format along with the Xaira software. It is freely available under a licence and is widely distributed.

Tagging
The BNC corpus has been tagged for grammatical information (part of speech). The tagging system named CLAWS went through improvements to yield the latest CLAWS4 system which is used for tagging the BNC. CLAWS1 was based on a Hidden Markov Model (HMM) and when employed in automatic tagging, managed to successfully tag 96% to 97% of each text analyzed. CLAWS1 was upgraded to CLAWS2 by the removal of the need for manual text processing before the texts could be ready for automatic tagging. The latest version CLAWS4 included improvements such as more powerful word-sense disambiguation (WSD) abilities, as well as being able to deal with varieties in orthography and markup language. Later work on the tagging system looked at increasing the success rates in tagging text automatically and reducing the work needed for manual processing while still maintaining effectiveness and efficiency by introducing software to do some of the manual work. Subsequently, a new program called the Template Tagger was introduced for a corrective function. Tags indicating ambiguity were later added. Manual tagging still has to be done as CLAWS4 is still unable to deal with foreign words.

The licence for the CLAWS4 tagger for part of speech information may be purchased to use the tagger. Otherwise, a tagging service is offered at Lancaster University.

Search Tools
SARA SCoPE

Permission Issue
The BNC was the first text corpus of its size to be made widely available. This could be attributed to the standard forms of agreement, between rights owners and the Consortium on the one hand, and between corpus users and the Consortium on the other. Intellectual property rights (IPR) owners were sought for their agreement to incorporate their materials in the corpus without any fees and shown the standard licence agreement which is relevant up till today. The acknowledgement with this arrangement may have been influenced by the originality of the concept and the prominence associated to this big idea.

However, there was the problem of keeping the identity of contributors hidden without discrediting the value of their work. Any distinct allusion to the identity of contributors was largely taken down and the alternative of substituting it with a different name had been discussed. Yet this solution using substitution was seen as being not feasible.

Adding on to the earlier problem was the fact that the contributors had earlier been asked only to incorporate transcribed versions of their speech and not the speech itself. While permission could be sought from initial contributors again, the lack of success in the anonymization process meant that it would be challenging to sought materials from initial contributors again. At the same time, two factors compounded the unwillingness of IPR owners to donate their materials. Firstly, full texts were to be excluded and secondly, there was no motivation for them to disseminate information using the corpus, particularly since the corpus operates on a non-commercial basis.

Design Issue
Within the spoken corpus, half of the component composed of recordings of 200 informal conversations across a demographic range. The other half consists of speech recorded in a large range of predefined situations. This was to account for both the demographic distribution of spoken language and those of linguistically-significant variation due to context. However, these classifications were not well-defined and coupled with the lack of ready information, affected the accuracy and consistency with which the variables were recorded in text headers.

Overly Broad Categories
The BNC has no text categorisation for written texts beyond that of domain and no categorisation for spoken texts except by context and demographic/socio-economic classes.

For example, a wide variety of imaginative texts (novels, short stories, poems, and drama scripts) is included in the BNC but such inclusions are deemed useless if researchers are unable to easily retrieve the sub-genres on which they want to work (e.g., poetry) because this information is omitted in the file headers or in any documentation associated with the BNC. There is at present no way to know whether an "imaginative" text actually comes from a novel, a short story, a drama script or a collection of poems (unless the title actually reflexively includes the words "a novel" or "poem").

Classification Errors and Misleading Titles
Some texts were classified under the wrong category, usually because of a misleading title. Users cannot always rely on the titles of the files as indications of their real contents: For example, many texts with "lecture" in their title are actually classroom discussions or tutorial seminars involving a very small group of people, or were popular lectures (addressed to a general audience rather than to students at an institution of higher learning).

Language Education
There are two general ways in which corpus material can be used in language teaching.

Firstly, publishers and researchers could use corpus sample to create language-learning references, syllabuses and other related tools or materials.

The BNC was used by a group of Japanese researchers as a tool in their creation of an English language-learning website for ESP learners. The website enabled English language learners to download frequently heard and used sentence patterns, and then base their own usage of the English language on these sentence patterns. The BNC served as the source from which the frequently used expressions were extracted. In using this website, users thus relied on reference samples from the BNC to guide them in their learning of the English language.

Such creation of materials that facilitate language-learning typically involves the use of very large corpora (comparable to the size of the BNC), as well as advanced software and technology. A large amount of money, time and especially expertise in the field of computational linguistics are invested in the development of such language-learning material.

Secondly, the analysis of corpus can be incorporated directly into the language teaching and learning environment. With this method, language learners are given the opportunity to categorize language data from the corpus and subsequently form conclusions about the patterns and features of their target language from their categorizations. This method involves a greater amount of work on the part of the language leaner and is referred to as “data-driven learning” by Tim Johns, who was a researcher in the field of Applied Linguistics at the University of Birmingham. The corpora data used for data-driven learning is relatively smaller and consequently the generalisations made about the target language may be of limited value.

In general, the BNC is useful as a reference source for the purposes of producing and perceiving text. In particular, the BNC can be used as a reference source when studying the use of individual words in various context, so that learners become familiar with the different ways to use particular words in suitable contexts.

Other then language-related information, encyclopedic information are also found in the BNC. Learners perusing data from the BNC are also introduced to British cultural features and stereotypes.