Llama.cpp

llama.cpp is an open source software library written in C++ that performs inference on various Large Language Models such as Llama. It is co-developed alongside the ggml library, a general-purpose tensor library.

History
llama.cpp began development in March 2023 by Georgi Gerganov as an implementation of the Llama inference code in pure C++ with no dependencies. This bettered performance on computers without GPU or other dedicated hardware. As of July 2024 it has 61 thousand stars on GitHub. Before llama.cpp, Gerganov worked on a similar library called whisper.cpp which implemented Whisper, a speech to text model by OpenAI. llama.cpp gained traction with users who lacked specialized hardware as it could run on just a CPU including on Android devices.

Architecture
llama.cpp initially could only run on CPUs but now can run on GPUs using multiple different back-ends including Vulkan and SYCL. These back-ends make up the GGML tensor library which is used by the front-end model-specific llama.cpp code. llama.cpp supports ahead of time model quantization as opposed to on-the-fly quantization.

GGUF file format
The GGUF file format is a binary format used by llama.cpp that stores both tensors and metadata. GGUF files are typically created by converting models developed in another file format from a different machine learning library such as PyTorch. It is the intention of GGUF's to make model files easy and fast to load within llama.cpp and other ggml projects.

GGUF was created to replace previous file formats used by the project which didn't include architecture metadata, and therefore made it difficult to extend the software without breaking backwards compatibility.

The format focuses on supporting different quantization types, which can reduce memory usage, and increase speed at the expense of lower model precision.

Supported data types
GGUF supports common floating-point data formats float32, float16, and bfloat16, as well as 1.5-bit and 2-bit to 8-bit quantized integer types.

Supported models

 * LLaMA
 * Llama 2
 * Llama 3
 * Mistral 7B
 * Mixtral 8x7B
 * Mixtral 8x22B
 * DBRX
 * GPT-2
 * BLOOM
 * Gemma
 * Grok-1
 * Mamba