Lingua Libre

Lingua Libre is an online collaborative project and tool by the Wikimédia France association, which aims to build a collaborative, multilingual, audiovisual speech corpus under a free license.

Description
Lingua Libre enables the recording of words, phrases or sentences of any language, oral (audio recording) or signed (video recording).



Words are presented to the speaker in the form of a list, created on the spot, in advance, or by reusing an existing Wikimedia category. The speaker simply reads the word displayed on the screen, and the software moves on to the next word when it detects a silence after the read word. This principle, borrowed from the open source software Shtooka recorder with the help of its creator, Nicolas Vion, makes it possible to record several hundreds of words per hour. The recordings are then uploaded automatically from the web client to the Wikimedia Commons media library.

In spring 2021, Lingua Libre was offline due to a fire in Strasbourg, but no audio recordings were lost.

Use of the recordings
The recordings can be consulted either on Lingua Libre or on Commons. They are mainly used on other Wikimedia projects, for example to illustrate entries on Wiktionaries or proper nouns in Wikipedia articles.

The re-use of the recordings in a language teaching context is envisaged. Language learners can freely download pronunciations and use them on GoldenDict, a popular dictionary software. Thus, audio recordings can be used as “Pronunciation Dictionaries” on GoldenDict without needing internet connection.

The recordings are also reused in Natural Language Processing projects, for example to drive Mozilla's DeepSpeech speech recognition engines.

Versions
Lingua Libre was initiated on January 23, 2015 and has had three successive versions:

Lingua Libre v.1 (2016)
As part of the Languages of France project, which aims to document and promote the regional languages of France on Wikimedia and Internet projects in general, the conception of Lingua Libre started in November 2015, partly funded by the DGLFLF (General Delegation for the French language and the languages of France). The first version of the project was launched in August 2016. Only suitable for audio recording, Lingua Libre was shown during a workshop on Occitan language in December 2016, and then presented to the online Wikimedia community and at international events in 2017.

Lingua Libre v.2 (2018)
A complete rebuilding was launched at the end of 2017. The new version of Lingua Libre is based on MediaWiki, uses Wikibase and OAuth to better integrate into the Wikimedia environment. The interface is translated via Translatewiki.net so that the project can be used by a large number of communities. The new version of the site was ready in June 2018 and opened to the public in August 2018.

Lingua Libre v.2.2 (2020)
In 2020, important changes were made to the platform; a new look was developed especially for the site, the .org domain replaced the .fr domain used until then, and added support for sign languages through video recording.

Statistics
In the first two years of the project's launch, approximately 10,000 recordings were made. The transition to v.2 was accompanied by a sharp increase in the contributions. The number of recordings multiplied by 10 in less than a year, exceeding the 100,000 threshold in May 2019. These recordings were made by 127 speakers in almost 50 languages. By September 2020, the platform had more than 300,000 recordings in 90 languages with more than 350 speakers. The 500,000 recordings milestone was reached in June 2021, thanks to 540 speakers of 120 languages.