Talk:Bzip2

Untitled
There was also briefly some content at bunzip2: bzip2 and bunzip2 are free open-source compression utilities.


 * They're a single utility with different behavior depending on the name used to call it, actually. :)

''Many consider them "third-generation" compression utilities, surpassing both first-generation tools (like arc and LHA) and second-generation tools (such as the popular PKZIP and gzip formats) in compression ability; it "pays" for this extra compression with an increased computational cost. Nonetheless, with the constant effect of Moore's Law making computer time less and less important, compression methods like bzip2 have become more popular.''

Of particular note is the fact that, unlike PKZIP, bzip2 is released under a very permissive license, which encourages its use in both open- and closed-source software.


 * Note that in addition to bzip2, gzip and zlib, there's the PKZIP-compatible Info-ZIP, also under a permissive license. --Brion 04:41 23 Jun 2003 (UTC)


 * Indeed. I was referring specifically to the PKZIP product, which doesn't have a permissive license.  With the move to bzip2, however, and the new article, I don't see much need for my third-generation rambles and jabber about PKZIP's license.  The addition of the Moore's Law comment is good enough for me; I was just trying to destubify an article I knew a bit about (bunzip2), but there's no need for that here. Phil Bordelon 04:47 23 Jun 2003 (UTC)

&pi;?
bit-sequences derived from the decimal representation of pi.
 * Eh? Why? -- Anon.


 * Indeed, this caught my eye too and is nonsense. -S

/*--  A 6-byte block header, the value chosen arbitrarily as 0x314159265359 :-). A 32 bit value does not really   give a strong enough guarantee that the value will not   appear by chance in the compressed datastream.  Worst-case   probability of this event, for a 900k block, is about   2.0e-3 for 32 bits, 1.0e-5 for 40 bits and 4.0e-8 for 48 bits.   For a compressed file of size 100Gb -- about 100000 blocks --   only a 48-bit marker will do.  NB: normal compression/   decompression do *not* rely on these statistical properties.   They are only important when trying to recover blocks from   damaged files. --*/


 * ...from . (And look at the fifth thru tenth bytes of a .bz2 file with a hex editor.) Frencheigh 17:51, 24 August 2005 (UTC)


 * I've also add that the other magic number used (to mark end of stream) is sqrt(π) and that both are in binary-coded_decimal format. Sladen 20:21, 10 January 2007 (UTC)

GN00
"In GNU, bzip2 can be used combined or independently of tar"

In GNU, what the heck? No-one says "in GNU" and it should be "In Unix", anyway. Opinions? Jsalomaa 21:55, 29 August 2005 (UTC)
 * Very well, I fixed it - you really can use it under practically any Unix and not just under GNU. Jsalomaa 19:56, 29 September 2005 (UTC)


 * Also older versions of GNU tar used the -I switch instead of the -j used currently. This used to be very confusing to me when the older version was still in use.  &#x2013; b_jonas 15:03, 31 January 2006 (UTC)


 * Except under some older UNIXes where bzip2'ing options (-I or -j) have not made it into tar! Jen 23:12, 15 May 2006 (UTC)

Origin of the 'b'?
Just curious -- does anyone know why bzip is called bzip? --babbage 20:40, 5 March 2006 (UTC)
 * -- the man page says it: bzip2, bunzip2 - a block-sorting file compressor; so the 'b' comes from 'block'


 * -- Or, just think of the "better ZIP" :-) -- since actually most compressors split their data into blocks before crushing each. Jen 23:10, 15 May 2006 (UTC)


 * -- Other possible explanation:it's one of the few algos that uses the Burrows Wheeler transform... The 'b' might come from that.

tar?
Why is there information about tar that doesn't relate to bzip? Goffrie 20:37, 3 June 2006 (UTC)

Redundant "Run-length encoding" sections
There's two "Run-length encoding" sections in the article. They need to be merged.--Father Goose 03:03, 23 April 2007 (UTC)
 * It appears to be clear from the textual-description of the two "RLE" stages that they both function in non-obvious-ways. The first stage encodes the length only after four consecutive symbols.  The later stage encodes the length as binary power of two using the additional symbols   and  .  Based on the differing ways in which the two operate (even if the heading is the same), I believe it would be inappropriate to merge them.  Note also that they two uses of the this technique are several stages apart;  to merge them would produce an inaccurate reflection of the compression stack. Sladen 09:56, 23 April 2007 (UTC)
 * Ah, okay, I failed to notice that section was describing the algorithm used, and that RLE is performed twice in the sequence, in two different forms. My error.--Father Goose 10:44, 23 April 2007 (UTC)

Technical limitations of bzip2
What are the technical limitations of bzip2? what is the maximum file size, the longest possible contained filename, the maximum amount of contained files etc.? --Loh 12:54, 24 April 2007 (UTC)


 * As far as I know there is no maximum file size (it is only limited by the containing filesystem, or perhaps [num of 900k blocks]*[maximum size of architecture integer]). Filenames and 'contained files' are not relevant because bzip2 is only a data compressor, not an archiver. It has no concept of combining together multiple files, typically tar is used for that. --Bk0 (Talk) 00:32, 25 April 2007 (UTC)
 * The Bzip2 format itself does not have support for storing a filename, or even a file timestamp. As noted by Bk0, data is seen just as one long stream, with no knowledge of the contents. The tar (file format) does have limitations, but those are not inherited from Bzip and Bzip2 can be used with an alternative archiving format such as an incompressed ZIP file. The Bzip2 file format does not contain any limitations (IIRC), though the supplied bzip2recover utility does. Sladen 10:33, 25 April 2007 (UTC)
 * Well, bzip2recover is designed to search for the two markers (pi and sqrt(pi), 48-bit binary-coded decimal) and to extract usable blocks from that. Incidentally, it should be noted that these markers do *not* have to be byte-aligned, and as such may not be easy to recognize in a file (manually using a hex editor or by something that searches for bytes). The first block marker is clearly recognizable, as it's right after BZh9 and looks like "1AY&SY" in ASCII, but the end-of-stream marker, which looks like ".re8P." - dots are non-printable ASCII - may not be recognizable in a text or hex editor to if it isn't on a byte boundary, but bzip2recover will still see it). Also, subsequent block markers may also be unrecognizable. nneonneo 04:10, 15 July 2007 (UTC)

Apache link removevd
I removed the external link to the Apache BZip2 implementation becacuse it doesn't seem to be standard. It expects the data to start at byte 3, without the "BZ" magic. --Zom-B (talk) 18:54, 11 June 2009 (UTC)

"Compression efficiency"
I made a cleanup to the "compression efficiency" because the old version makes inaccurate claims (specific amount of speed compared X) and the benchmark reference is not very good one. Futhermore the technical description of magic number is irrelevant to compression and the magic number is already described in the technical section about the format, which is where it belongs. Samir000 (talk) 20:09, 14 February 2010 (UTC)

Second RLE and Huffman descriptions
Since the output of the second RLE is used by Huffman, it seems to me that it must already remove the zero symbol (which is instead replaced by RUNA/RUNB). Thus the text would be as follows:
 * Run-length encoding (RLE): long strings of repeated symbols in the output (normally zeros by this time) are replaced by a combination of the symbol and a sequence of two special codes,  and , which represent the run-length as a binary number greater than zero (0).  Because the MTF encoding transforms a run in a single non-zero element followed by a streak of zeros, RLE is only used to compress runs of zero.  Apart from this limitation, this RLE process is more flexible than the RLE of step 1, as it is able to encode arbitrarily long integers (in practice, this is usually limited by the block size, so that this step does not encode a run of more than 900000 bytes). The run-length is encoded in this fashion: assigning place values of 1 to the first bit, 2 to the second, 4 to the third, etc. in the RUNA/RUNB sequence, multiply each place value in a RUNB spot by 2, and add all the resulting place values (for RUNA and RUNB values alike) together. Thus, the sequence RUNB, RUNA results in the value (1*2 + 2) = 4, the sequence   would be represented as  .  As a more complicated example:
 * Huffman coding: this process replaces fixed length symbols (8-bit bytes) with variable length codes based on the frequency of use. More frequently used codes end up shorter (2-3 bits) whilst rare codes can be allocated up to 20 bits.  The codes are selected carefully so that no sequence of bits can be confused for a different code. The end-of-stream code is particularly interesting. If there are n different bytes (symbols) used in the uncompressed data, then the Huffman code will consist of two RLE codes (RUNA and RUNB), n-1 symbol codes and one end-of-stream code. As in the previous step, there is never any need to explicitly reference the first symbol in the MTF table (it is replaced instead by RUNA and RUNB), thus explaining why only n-1 symbols are coded in the Huffman tree. In the extreme case where only the zero symbol is used in the uncompressed data, there will be no symbol codes at all in the Huffman tree, and the entire block will consist of RUNA and RUNB (implicitly repeating the single byte) and an end-of-stream marker with value 2.
 * Huffman coding: this process replaces fixed length symbols (8-bit bytes) with variable length codes based on the frequency of use. More frequently used codes end up shorter (2-3 bits) whilst rare codes can be allocated up to 20 bits.  The codes are selected carefully so that no sequence of bits can be confused for a different code. The end-of-stream code is particularly interesting. If there are n different bytes (symbols) used in the uncompressed data, then the Huffman code will consist of two RLE codes (RUNA and RUNB), n-1 symbol codes and one end-of-stream code. As in the previous step, there is never any need to explicitly reference the first symbol in the MTF table (it is replaced instead by RUNA and RUNB), thus explaining why only n-1 symbols are coded in the Huffman tree. In the extreme case where only the zero symbol is used in the uncompressed data, there will be no symbol codes at all in the Huffman tree, and the entire block will consist of RUNA and RUNB (implicitly repeating the single byte) and an end-of-stream marker with value 2.

Another change is in the final line: since the 0 value is not represented, the byte values there are 1-255, and the end of stream is represented by a value between 2 and 257 (not 258). Can anyone confirm and/or make the changes? Balabiot (talk) 07:45, 1 March 2010 (UTC)
 * I have tweaked it slightly to remove the "8-bit bytes" mention, as at this stage RUNA/RUNB will have been included into the stream of symbols (range 0-257), with the end of stream marker that is 0-258. IIRC, the symbol block for is only nullified in the case of there only being a single normal symbol used.  —Sladen (talk) 09:48, 1 March 2010 (UTC)
 * bzip2-1.0.6 compress.c#228 looks like that the end of stream is 257 (mtfv[wr] = EOB; and EOB = s->nInUse+1;). — Preceding unsigned comment added by 2001:268:C143:8531:E42A:B9E7:E6E5:7B19 (talk) 13:52, 27 April 2018 (UTC)

LZMA
Article says "LZMA is generally more space-efficient than bzip2 at the expense of slower compression speed, while having much faster decompression." This is based on these benchmarks: http://compressionratings.com/comp.cgi?7-zip+9.12b++bzip2+1.0.5++gzip+1.3.3+-5 But in these benchmarks only highest compression level(-9) of LZMA was tested. This benchmarks(http://tukaani.org/lzma/benchmarks.html) show that LZMA with different settings(-1 or -2) can be comparable to or more efficient than bzip2 in both compression speed and space-efficiency. 213.155.215.214 (talk) 19:32, 26 February 2012 (UTC)

Blocksize
The file format section states for the "contents" part that it is "max. 7372800 bit". This would translate to 900 * 1024 * 8. However, the bzlib.c calculates the number of the buffer to be: 1000 * blockSize100k. It looks like therefore this might be an error in the Wiki article, i.e., kilo and kilo binary were confused for each other. Therefore, I think it should read "max. 7200000 bit". However, I'm not 100% sure. If anyone could check this, please? Maxiantor (talk) 15:19, 19 November 2019 (UTC)
 * I can help with validation as well as editing the page. "the bzlib.c calculates", can you be more specific about which file (version) and which line? BernardoSulzbach (talk) 16:48, 23 November 2019 (UTC)
 * The contents part of the file format is not the BWT buffer size, so I doubt either of these maximums is correct. The BWT step limits the block to 900,000 symbols. But in the worst pathological case none of the symbols gets run-length encoded and the huffman coding could use 20 bits for each of them. Shouldn't the maximum be 900,000 * 20 = 18,000,000 bits? --EoZahX9m (talk) 05:32, 3 April 2020 (UTC)