Compressing List of Tuples



Perhaps this is an unusual problem. I am hoping that it is not, and
that someone who frequents this list can make some suggestions.

I have a requirement to compress large lists of 6 integer element
tuples which look like this:

1,5,11,1,2,2
1,5,11,1,2,9
1,5,11,1,4,2
1,5,11,1,4,5
1,5,11,1,4,9
1,5,11,1,5,1

Individual field in the tuples have values between 1 and 14. A typical
list might contain 50,000+ unique tuples with no list ever containing
duplicate tuples.

The catch is that I *must* use a specific (see below) compression
format. The real nature of my problem is trying to maximise the speed
and compression factor.

Compressing the above list using the required format gives:

1,5,11,1,2+4,2 can be expanded to recover original tuples
1,5,11,1,2,2 and 1,5,11,1,4,2
1,5,11,1,2+4,9 can be expanded to recover original tuples
1,5,11,1,2,9 and 1,5,11,1,4,9
1,5,11,1,4,5 could not be 'merged' with another tuple in the
input list
1,5,11,1,5,1 could not be 'merged with another tuple in the
input list

compressed tuples may contain multiple terms in each of the 6
postions. e.g.

1+2,1+2,1+2,1+2,1+2,1+2 is a compressed representation of 64
individual tuples:

1,1,1,1,1,1
1,1,1,1,1,2
1,1,1,1,2,1
1,1,1,1,2,2
..
..
2,2,2,2,2,2

it does not make sense to repeat an integer within an individual field
of a tuple: 1+2+2,1,1,1,1,1 is illegal. 1+2+14,1+2,1,1,1,1+14 is
legal.

the overriding objective is to reduce the number of *lines* in a file
of input tuples as much as possible.

anyway, it is trivial, given a list of compressed tuples, to expand
them to get the original list of tuples with no multiple terms in each
field.

what is not so trivial, is to determine an optimal list of compressed
tuples which exactly covers the original list of uncompressed tuples.

i am aware that this is analogous to the set cover problem, and also
aware of a graph matching approach which looks good on the back of an
envelope, but which is perhaps 'too hard' in real life to solve.
Regarding a set cover approach, there is also a problem in that the
set of legal compressed tuples is HUGE (consider how many possible
ways there are to create compressed tuples with between one and
fourteen elements in each of six fields:)). anyway, NP-hard either
way... so unless i'm mistaken, we're better off looking at heuristics.

i have a naive greedy solution which achieves (on average) 10x
reduction in the number of lines of tuples. however, i believe that
one should be able to do better than that.

i'd be very interested to hear any new ideas - especially ideas about
how i might prune the solution space.

PS: i should reiterate - the compression format i use is non-
negotiable because the black box system which consumes the end-result
data can accept only either uncompressed tuples or tuples compressed
using the format i described above.

.



Relevant Pages

  • Group ordering and comparison
    ... I need a little help proving a conjecture I've come up with in relation ... I am currently working on an in-memory compression patch for Linux 2.6, ... Starting at the beginning of lists L and K, ... When memory pressure is too high, I throw the LCI groups out ...
    (Linux-Kernel)
  • Re: compressing tuples in scheme
    ... I need to write a function merge-tups which given lolol produces this: ... Except for the small matter of actually writing merge-tups in Scheme! ... online help) figured out how to do rotations, mappings over lists and ... in 8-10x compression when dealing with large files of 6-tuples. ...
    (comp.lang.scheme)
  • Re: Searching Google n-gram corpus
    ... designed along the same lines but with a lot of compression. ... Well there are at least two different directions of compression: words in the n-grams and the word lists. ... x'8000'-x'bfff', and the remainder 3 byte ids from x'c00000'-x'dfffff'. ... wordid as described above, position in ngram, number of ngrams, list of ngrams ...
    (comp.databases.theory)
  • Compressing List of Tuples
    ... I have a requirement to compress large lists of 6 integer element ... The catch is that I *must* use a specific compression ... Compressing the above list using the required format gives: ... 1,5,11,1,2+4,2 can be expanded to recover original tuples ...
    (comp.compression)
  • RE: Standard Word doc to Book format
    ... We type seperate lists of acronyms, ... To create your TOC, you will have to mark, or verify that the major minor ... If your headings do not show a heading style, ... choose to apply the heading style (format menu, ...
    (microsoft.public.word.docmanagement)