Recompressing poorly compressed files
- From: aguydude@xxxxxxxxx
- Date: Thu, 4 Dec 2008 18:59:59 -0800 (PST)
Yes, you can compress already compressed files. If the initial
compression algorithm was not very good, and the 2nd one is very
good. You normally won't get a smaller file doing it this way than
just compressing with a good algorithm in the first place of course.
If I thought I could somehow compress things over and over again, I'd
be pelted with the pigeonhole principal. But starting from a poorly
compressed file and somehow getting out a smaller file (e.g. by
decompressing it and recompressing it) probably is not controversial,
particularly if you know which algorithm was used to do the initial
compression.
The idea below seems kind of obvious, but I've never seen it, so feel
free to point out where it's been done.
Let us suppose I'll come up with 65535 algorithms, as well as the null
(do nothing) algorithm. For what I'm about to discuss. Yay, all my
files are now going to be no more than an extra 2 bytes long to
implement the idea I'm about to discuss (ones for which this idea
fails also get an extra two bytes). Any time I encounter an algorithm
that sometimes manages to make a compressed file be twice as big as
the uncompressed file, I wonder why they didn't just make all their
compressed files 1 byte bigger (or one bit. They can use the other 7
bits in some cases) so they could add a flag to their algorithm to
indicate if it worked or not. Yes, I know failed compression needs to
account for things like filenames, attributes, etc, so you'll probably
increase size by more than 1 byte. But that is no excuse for doubling
your file size when your algorithm fails. :::Glares at RLE:::
Anyhow, let's go crazy and imagine that we really don't mind bloated
algorithms for compression and decompression, or even if the're a bit
slow. Though I can imagine a quick and dirty, incomplete but
practical implementation of this solution which increased the time
requirements of this algorithm by A and the memory requirements by B.
A being something under 3 * (decompress time for crappy, slow
algorithm + compress time for crappy, slow algorithm) and B being
something under the total size of GIMP and the 7-zip program + 5MB or
so.
Anyhow, the new algorithm will work as follows. Any time it detects
data that it recognizes as being compressed, it decompresses it before
running it's favorite algorithm (or trying a few if you want it to run
slower). It applies this decompression step recursively if
necessary. In the case that parts of this decompression are not
reversible (e.g. 'decompressing' a gif into a bitmap means losing some
headers), delta code is added after the header (yes, fine, we need 1
more bit to indicate if the delta is there or not, but only if the
compression works).
A practical example is this. If I have a CHM file full of gifs and
try to zip it, nothing happens. If I decompile it, replace all the
gifs with either bitmaps or pngs containing exactly the same colors at
each pixel and then I zip it, I end up with a much smaller file, even
though the data itself is sufficient to regenerate the original chm
file...or something similar enough that a delta won't be nearly as big
as this size difference. I tried this the other day with a 600KB chm
and got a working 500KB chm (it wasn't the same chm, it just looked
the same). That makes me sad. Zipping my chm file into a smaller
file should work. Zipping 20 versions of my chm file, each with
slight variations in their contents, should not fail miserably.
Anyhow, I'm sure I could've created the 600KB chm from the 500KB chm
with less than 100KB worth of information.
This algorithm is probably a bad idea for things like JPG, since
trying to find a file to use to regenerate a lossy file is probably
difficult (not necessarily impossible, but close enough). Worry about
that in version 2 :).
There probably aren't 65535 compression algorithms that are popular
enough to make it into such a program. The next step is to do the
same thing with compilers. This is probably largely impossible with
things like C++, but with interpretted languages like Python or Java
it should be at least somewhat practical. Decompiling Java code is
not really that tough. You'll probably get a similar file if you
recompile it, assuming, assuming you use the right version of Java.
Or perhaps even if you didn't. Getting this kind of thing working is
probably not possible (practically or legally) unless done by the same
people writing the compilers, particularly since you'll want it to
work properly when compiling code that isn't legal (we want to
generate bytecode for everything. We don't care if that bytecode is
actually valid, since we'll be storing a delta of that bytecode and
the one we want). But Jar files seem larger than necessary, and
there's nothing wrong with distributed the decompression tool when you
distribute the SDK. Well, except file size. Oh well :(
One could generalize this solution by having a program which
"decompiled" executables into some sort of special language, but we
already have UPX, so let's not worry about that.
It's impossible (in practice) to generalize this kind of solution to
arbitrary, unknown compression algorithms and generators, considering
that such compressed data is high entropy. But it's not impossible in
theory (the existence of an SFX which runs in reasonable time means
such an SFX can somehow be generated, if only by trying every possible
program until you reach the size of the starting data and picking the
smallest one that works, skipping anything that takes too long to
run), as long as you ignore any limits on size and power and more on
the decompression and compression program. Obviously such a program
will not be computable in any reasonable amount of time, nor will it
necessarily be the smallest possible program (IIRC, finding the
smallest possible SFX is not even computable). Hell, we can just test
every possible program until we find one at least as good as mine.
Consider Joe's really, really awful compression algorithm. It's a
10KB piece of garbage that XOR's pseudo-random numbers (using a
hardcoded seed and a pseudorandom number generator) with your files.
It also prepends the filename and algorithm name to the compressed
file. Joe claims his compression algorithm that works well since
sometimes it makes the files smaller (because he hard-coded it to make
one specific file smaller) and all the resulting files are not
compressible any further using LZMA. If you pick the right random
number generator, you can probably get this to happen. I'll assume
one can code it to be under 10KB. If not, replace JOE's algorithm
with an algorithm that zips files into zip files into zip files until
the file reaches the size of his file. Well, I can easily enough make
an SFX of JOE's compressed copy of wikipedia (i.e. running the SFX
will generate JOE's compressed file, not wikipedia) that is smaller by
just embedding his compressor, a compressed copy of wikipedia, and my
own decompressor into a file. Yay, I've compressed high-entropy data
into an SFX. But it's not random data, and my high-entropy file,
though it may be reduceable in size by the same technique, will
eventually reach a limitation.
But we don't need to generalize it. We can just pick a lot of popular
algorithms instead. And hey, this algorithm has a 2nd bonus when you
start compressing in solid blocks, since two similar pngs won't
compress into a small file.
As an aside, this technique CAN be used to generate SFXs, but such
SFXs may have too much overhead to be used for smaller files. You
only need to include the generators for the stuff that you need. If
you feel like trying this, grab a really big Gif file (or a lot of
small ones). Try to compress it. You may very well not be able to
with traditional algorithms. Convert it to bmp and compress (or just
convert to png). It will be smaller, if you picked the right gif.
Take your png and embed it into an exe as a resource or whatever.
Running the exe will convert the png to gif, fix any header
information, and save that gif to file. Congratulations, you have a
program that outputs a gif that was bigger than the file you started
with. And the Gif is from someone else.
I suppose some people would argue that rather than finding ways to
produce smaller installers when using gif files, the correct solution
is to burn all gifs. And to start using better compression
algorithms. Ah well, I suppose this idea has pretty minimal utility
in the long run, since any program that does this will slowly be
obsoleted as people stop using the crappy compression algorithms it
was designed to fix. Well, it's still useful to compress generated
files like .class files. Maybe.
:::Gets up, shields self from tomatoes:::
.
- Follow-Ups:
- Re: Recompressing poorly compressed files
- From: Thomas Richter
- Re: Recompressing poorly compressed files
- Prev by Date: Re: Suggest any topic..
- Next by Date: Re: unRar source code
- Previous by thread: unRar source code
- Next by thread: Re: Recompressing poorly compressed files
- Index(es):
Relevant Pages
|
Loading