Change to bzip2?

Jindrich Novy jnovy at redhat.com
Wed Feb 2 16:00:04 UTC 2005


On Wed, 2005-02-02 at 15:45 +0100, Florian La Roche wrote:
> On Wed, Feb 02, 2005 at 06:03:24AM -0800, Steve G wrote:
> > Hi,
> > 
> > With the discussion about trimming specfile changelogs to save space and improve
> > downloads...why not go one step further? Mandrake has been using bzip2 for a
> > while and it works just as well and files are significantly smaller. The
> > conversion could be done in several steps:
> > 
> > 1) man pages - less already handles bzipped man pages
> > 2) info pages - I submited patch in bz #128637 to try to get it working
> > 3) tar
> > 4) rpms - I'm sure the patch is in Mandrake's version
> > 
> > Thoughts?
> 
> bzip2 is only used for the cpio-packed file-data, the rpm-header is
> not compressed. For the repo-data the changelog can also be trimmed,
> only if you need to copy the rpm header unmodified this is actually
> getting a problem (e.g. if you later-on want to verify the md5sum to
> be the same as in full rpms you download or similar things).
> 
> I think staying with gzip is ok as it really is a good middle ground
> between speed and disk compression ratio. bzip2 "feels" noticable slower.

In my opinion a conversion to bzip2 is a right thing to do. I'm also
trying to keep almost everything compressed to bzip2 because of its
significantly better compression scheme and performance. I'll illustrate
this on the mc tarball:

-rw-rw-r--  1 jnovy jnovy 2831562 Jan 28 09:52 mc-4.6.1-pre3.tar.bz2
-rw-rw-r--  1 jnovy jnovy 3956127 Feb  2 15:26 mc-4.6.1-pre3.tar.gz

where we can see that the gzipped tarball is larger of more than 1/3 in
comparison with the bzipped one. Decompression times are:

gunzip decompression:
real    0m0.257s
user    0m0.198s
sys     0m0.059s

bunzip2 decompression:
real    0m1.665s
user    0m1.567s
sys     0m0.098s

so a conclusion could be that bunzip2 is about 6-7 times slower than
gunzip. This is unfortunately a common myth among developers because
bzip2 uses the best compression (-9, so 900k blocks for BWT) by default
and gzip uses compromised performance (-6), but that means something
different compared to bzip2 since gzip is LZ77 based.

bzip2 is scalable enough to use even better compression times or
performance. If you consider that for the fastest (and worst) -1
compression with bzip2 you'll get:

-rw-rw-r--  1 jnovy jnovy 3592894 Feb  2 16:08 mc-4.6.1-pre3.tar.bz2

what is even better than the best compression (-9) with gzip and
decompression time is:

real    0m1.076s
user    0m1.003s
sys     0m0.073s

so about 4 times slower than gzip.

The question is what is the priority at the moment, if a space consumed
by the file or a decompression time. 

There are also some projects such as pbzip2
(http://compression.ca/pbzip2/) that uses a fact that bzip2 actually
compresses parts of large files in separated blocks, so that the BWT and
Huffman encoding phase can be performed separately on these blocks
simultaneously in multiple threads what speeds compression/decompression
times significantly up on smp machines.

Further if you consider scalability of bzip2 which has a compression
range:

best (-9): 2831562, worst (-1): 3592894
and gzip:
best (-9): 3931362, worst (-1): 4634277

I think bzip2 is the winner at least from the future point of view.

Cheers,
Jindrich

-- 
Jindrich Novy <jnovy at redhat.com>, http://people.redhat.com/jnovy/




More information about the fedora-devel-list mailing list