Re: Partitioned decompression
- From: "George Johnson" <matrix29@xxxxxxxxxxx>
- Date: Sat, 8 Sep 2007 23:09:30 -0400
"gabiruh" <gabiruh@xxxxxxxxx> wrote in message
news:1189083775.122981.220660@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
| This big compressed file will be transfered between a server and the
| clients in a P2P way, so, clients requests chunks of this file to the
| server and to other clients to minimize network load on the server
| side. It works like bittorrent, the chunks are requested randomly by
| the algorithm in the clients.
| When a peer receives the chunk it decompresses it on the fly, when
| this chunk is requested by other peer, the current peer must compress
| this chunk again and send it. The process goes on til' the file is
| completly transfered among the peers.
|
| It has to be this way, starts with a big compressed file on the server
| and ends with this file decompressed on the clients.
|
| I want to know what compression algorithm fits in this scenario...
| LZW, Huffman, arithmetic coding, PPM, DMC....whathever.
Actually, any of those compression programs are fine, but since you are
in reality "streaming" the file in discreet chunks, I would just compress
those chunks (at an loss of compression ratios) as individual files and send
them in whatever compression format that the client comparison program can
decompress. This technique is non-challenging to a high degree and can be
handled with simple command line programs that you can download pretty much
easily off a GOOGLE search for "File splitters", "MD5 checksum calculation",
"command line compression decompression". Boring, easy, simple. Yawn. Oh,
and for those folks asleep by now, you need to send off the original file
name, creation date, and comments to the client from the server.
==================
So you're aiming at a compression scheme suitable for game or image
library distribution.
I can logically assume that this would want this for game update
functionality.
Since your server files are going to be a difference between the server
file and the client file, you can use a basic "Difference compression
coding" scheme. In that method, the server runs a checksum for every 100 KB
(or larger 500 KB chunks) of the file between the server and the client, if
there is a difference in checksum (preferably a MD5 checksum) between server
and client then the server sends that chunk to the client. BITTORENT works
using that scheme, but is also "Compressed File Indexing Aware"
Now, the problem with simple difference checksum coding is that if the
file is expanded rather than altered, any part of the file after the
expansion will obviously not match up on the checksum. If the file is 100
MB and even if you were to go backwards from the end of the file and find
that only 4 KB had been added in the middle somewhere in a patch for a
quickfix, the result would be tens of MB of wasted download time.
A quick fix on basic difference coding would be to run a checksum string
set backwards from the end of file as well as from the front, but again,
even a 4KB server / client file difference would result in wasted
redownloading of tens of MB.
There is a simple fix for this problem.
Server file and client file use displacement checksum difference coding.
How is this accomplished, you ask?
Server does a complete file checksum, client does a complete file
checksum. If both match, no file sending is necessary. This is the
paranoid method for making certain that both files are intact upon game
start. However it wastes computational time, so a quicker method is to run
a complete file checksum after client file download is correctly verified,
store the checksum, and when time comes on client/server compare, just
compare the most recent checksums without recalculating. This is nearly an
instant result unless the client file or server file has become corrupted by
disk errors between now and the last time. An easy fix for this is to
automatically run checksum recomputes every week as a rule or give the
client program an optional "Click to re-verify file correctness" button in
the functional interface.
Now, onto displacement checksum difference coding.
If client file is different in checksum than the server file checksum,
then the server consults its 100 KB checksum chunks which were calculated
upon the last "compile approval" of the file prior to its being moved onto
the "Client File Transfer Server".
The client file verification & update program (if it hasn't stored 100
KB checksum chunks prior to this to save client disk space) then calculates
fresh 100 KB checksum chunks.
If the file has different 100 KB checksum chunks then we resend those
chunks, but the twist is that we store 1 KB of the actual file for the
beginning and end of those 100 KB checksum chunks. Server and client share
those chunks for comparitive searches.
Visual example... (each "M" = matching 100 KB checksum chunk, each "r" =
"Resend")
SERVER & CLIENT using standard difference checksumming
MrMMMrrrMMMrrrrrMMMrMM = 10 "r" resends = 1 MB of file chunk sending
Using the above visual sample, see a new letter, "_" = Insertion.
Let's say that the above file was a graphics file and the art department
added in more object textures. Every chunk that follows a "_" would result
in resending EVERYTHING following an "_". Wasteful slow plus inefficient.
MrMrrrrrrrrrrrrrrrrrrrrrrrr = 25 chunks resent = 2.5 MB
Displacement checksum difference coding.
MrM___MMrrr_MMMrrrrrM_MMrMM = 15 chunks resent = 1.5 MB
The insertions are identified by using the 1 KB chunk begin + 1 KB chunk
end sent by the server and doing a seek compare to find the identical 1 KB
begin chunk in the client file (which also should share an indentical 1 KB
end chunk 98 KB from whereever the identical 1 KB begin chunk is found).
Within that 100 KB displacement chunk, the checksum should be identical, but
moved. The result is a loss of only 200 KB of resending for any distinct
insertion chunk and a great increase in speed for any tiny file insertion.
This can be also used to identify chunk removal too, but the method is
slightly different in that the assumption is not a file increase, but
remembers that the developers do delete stuff sometimes and allows for it.
This is more a "being awake with a pulse" thing for a programmer over
just mindlessly programming a simple program.
.
- Follow-Ups:
- Re: Partitioned decompression
- From: gabiruh
- Re: Partitioned decompression
- References:
- Partitioned decompression
- From: gabiruh
- Re: Partitioned decompression
- From: Phil Carmody
- Re: Partitioned decompression
- From: gabiruh
- Partitioned decompression
- Prev by Date: Re: Matrixview SWISH almost two times better compression then GZIP and much faster
- Next by Date: Re: Matrixview SWISH almost two times better compression then GZIP and much faster
- Previous by thread: Re: Partitioned decompression
- Next by thread: Re: Partitioned decompression
- Index(es):
Loading