BLOOM (language model)

BigScience Large Open-science Open-access Multilingual Language Model (BLOOM) is a 176-billion-parameter transformer-based autoregressive large language model (LLM). The model, as well as the code base and the data used to train it, are distributed under free licences. BLOOM was trained on approximately 366 billion (1.6TB) tokens from March to July 2022.

BLOOM is the main outcome of the BigScience collaborative initiative, a one-year-long research workshop that took place between May 2021 and May 2022. BigScience was led by HuggingFace and involved several hundreds of researchers and engineers from France and abroad representing both the academia and the private sector. BigScience was supported by a large-scale public compute grant on the French public supercomputer Jean Zay, managed by GENCI and IDRIS (CNRS), on which it was trained.

BLOOM's training corpus, named ROOTS, combines data extracted from the then-latest version of the web-based OSCAR corpus (38% of ROOTS) and newly collected data extracted from a manually selected and documented list of language data sources. It encompasses 46 natural languages (in amounts ranging from 30% of the whole dataset for English to 0.00002% for Chi Tumbuka) and 13 programming languages.