Liblzg

liblzg is a compression library for performing lossless data compression. It implements an algorithm that is a variation of the LZ77 algorithm, called the LZG algorithm, with the primary focus of providing a very simple and fast decoding method. One of the key features of the algorithm is that it requires no memory during decompression. The software library is free software, distributed under the zlib license.

Algorithm
If a duplicate series of bytes (a repeated string) is spotted in the uncompressed data stream, then a back-reference is inserted, linking to the previous location of that identical string instead. An encoded match to an earlier string consists of a length (3–128 bytes) and a distance (1–526,341 bytes). The level of compression can be controlled by specifying the maximum distance for which duplicated strings will be searched (this is the size of the sliding window).

Data format
The data format consists of a header, followed by the compressed data. The header contains an identifier and house keeping information, such as compressed and decompressed data sizes and a 32-bit checksum (a variant of the Fletcher checksum).

The compressed data starts with four bytes, identifying four unique 8-bit marker symbols (m1, m2, m3 and m4). These are used to separate literal data bytes from various forms of length-distance pair encodings.

Any symbol that is not a marker byte is considered a literal byte, and will be copied as is to the decompressed data buffer. However, if the decoder encounters any of the four marker bytes, it will decode a length-distance pair that is used as a back reference into the previously decompressed data.

The marker bytes are interpreted as follows (% denotes a binary number):

General copy (m1)
m1 represents the most general form of a copy operation, and it occupies four bytes in the compressed data stream: ...where length=DECODELENGTH(%lllll+2), and offset=%ooommmmmmmmnnnnnnnn + 2056.

Medium copy (m2)
m2 is a shorter form of a copy operation, occupying three bytes in the compressed data stream: ...where length=DECODELENGTH(%lllll+2), and offset=%ooommmmmmmm + 8.

Short copy (m3)
m3 requires only two bytes, and is used for short lengths, close to the marker: ...where length=%ll+3, and offset=%oooooo + 8.

Near copy (m4)
m4 requires only two bytes, and is used for nearby copies (including RLE, when the offset is 1): ...where length=DECODELENGTH(%lllll+2), and offset=%ooo + 1.

Literal copy
As a special case, if any of the marker symbols are followed by a zero byte (0), the marker symbol itself is written to the decompressed buffer.

Non-linear length encoding
The DECODELENGTH function implements a non-linear mapping of a number in the range 3-33 to a number in the range 3-128, according to the following table:

Worst case data growth
As the marker symbols are chosen as the four least common symbols in the uncompressed data stream (with a probability of at most $$\tfrac{1}{256}$$ each), and a single occurrence of a marker symbol requires two bytes to encode, the compressed data may grow by at most $$\tfrac{4}{256}$$ < 1.6% compared to the decompressed data (worst case).

The liblzg library compensates for this by using a plain 1:1 copy mode if the encoder identifies that the compressed data will be larger than the original uncompressed data. Hence, in practice, the maximum data growth is 0% (plus the size of the data header, which is 16 bytes).

Implementations
Both the compression and the decompression algorithms are implemented in an open source library, written in the C programming language. There are also several alternate implementations of the decompression algorithm available (for instance in JavaScript and 8-bit assembly language).