CorCenCC



CorCenCC or (Welsh: Corpws Cenedlaethol Cymraeg Cyfoes) the National Corpus of Contemporary Welsh is a language resource for Welsh speakers, Welsh learners, Welsh language researchers, and anyone who is interested in the Welsh language. CorCenCC is a freely accessible collection of multiple language samples, gathered from real-life communication, and presented in the searchable online CorCenCC text corpus. The corpus is accompanied by an online teaching and learning toolkit – Y Tiwtiadur – which draws directly on the data from the corpus to provide resources for Welsh language learning at all ages and levels.

Launched in September 2020, CorCenCC is the first corpus of the Welsh language that includes all three aspects of contemporary Welsh: spoken, written and electronically mediated (e-language).

Composition
CorCenCC extends to 11 million words of naturally occurring Welsh language (note: the version of the corpus available on the CorCenCC website reports results in tokens rather than words). The creation of CorCenCC was a community-driven project, which offered users of Welsh an opportunity to contribute to a Welsh language resource that reflects how Welsh is currently used. The dataset, therefore, offers a snapshot of the Welsh language across a range of contexts of use, e.g. private conversations, group socialising, business and other work situations, in education, in the various published media, and in public spaces. A full list of contexts, genres and topics included are available on the project's website.

Conversations were recorded by the research team, and a crowdsourcing app enabled Welsh speakers in the community to record and upload samples of their own language use to the corpus. The published CorCenCC corpus was sampled from a range of different speakers and users of Welsh, from all regions of Wales, of all ages and genders, with a wide range of occupations, and with a variety of linguistic backgrounds (e.g. how they came to speak Welsh), to reflect the diversity of text types and of Welsh speakers found in contemporary Wales.

Tools

 * 11 million word Welsh language dataset
 * The CorCenCC sampling frame
 * Transcription protocols for spoken Welsh
 * Welsh-language POS tagset and tagger, CyTag (English: ): a Welsh POS tagger (with bespoke tagset) designed and constructed for the project. It is used in conjunction with the semantic tagger to tag all lexical items in the corpus.
 * CySemTag (English: ): The Welsh Semantic Tagger  applies corpus annotation automatically to Welsh language data.
 * A Welsh language pedagogic toolkit, Y Tiwtiadur, which includes:
 * a Gap Filling (Cloze) tool
 * a Word Profiler tool
 * a Word Identification tool
 * a Word Task Creator tool
 * Crowdsourcing app for data collection: designed to allow Welsh speakers to record conversations between themselves and others across a range of contexts and to upload them, complete with ethically compliant consent from participants, for inclusion in the final corpus. Crowdsourced corpus data is a relatively new direction that complements more traditional language data collection methods, and is suited to the community spirit that exists among speakers and learners of Welsh and other minoritised languages.
 * CorCenCC’s new corpus infrastructure query tools which include the following functionalities:
 * Simple query
 * Complex query
 * Frequency list generation
 * Collocation analysis
 * N-gram analysis
 * Concordancing
 * Keyword analysis

Funding
The research on which CorCenCC project was based was funded by the UK Economic and Social Research Council (ESRC) and Arts and Humanities Research Council (AHRC) as "Corpws Cenedlaethol Cymraeg Cyfoes (The National Corpus of Contemporary Welsh): A community driven approach to linguistic corpus construction project" (Grant Number ES/M011348/1).