T5 (language model)

T5 (Text-to-Text Transfer Transformer) is a series of large language models developed by Google AI. Introduced in 2019, T5 models are trained on a massive dataset of text and code using a text-to-text framework. The T5 models are capable of performing the text-based tasks that they were pretrained for. They can also be finetuned to perform other tasks.They have been employed in various applications, including chatbots, machine translation systems, text summarization tools, code generation, and robotics.

Like the original Transformer model, T5 models are encoder-decoder Transformers, where the encoder processes the input text, and the decoder generates the output text.

It was updated by T5X in 2022 to use JAX. In 2024, T5X was updated to Pile-T5 by training the same architecture on an improved dataset (The Pile).

Training
T5 models are pre-trained on the Colossal Clean Crawled Corpus (C4), containing text and code scraped from the internet. This pre-training process enables the models to learn general language understanding and generation abilities. T5 models can then be fine-tuned on specific downstream tasks, adapting their knowledge to perform well in various applications.

The T5 models were pretrained on many tasks, all in the format of  ->.

Some examples are:


 * restoring corrupted text:  ->   where the   means "end of output".
 * translation:  ->.
 * judging the grammatical acceptability of a sentence (CoLA sentence):  ->.

Architecture
The T5 series encompasses several models with varying sizes and capabilities. These models are often distinguished by their parameter count, which indicates the complexity and potential capacity of the model. The original paper reported the following 5 models:

In the above table,


 * # layers: Number of layers in the encoder; also, number of layers in the decoder. They always have the same number of layers.
 * # heads: Number of attention heads in each attention block.
 * $$d_{model}$$: Dimension of the embedding vectors.
 * $$d_{ff}$$: Dimension of the feedforward network within each encoder and decoder layer.
 * $$d_{kv}$$: Dimension of the key and value vectors used in the self-attention mechanism.