Re: Compressing List of Tuples



bullockbefriending bard wrote:
I could have sworn that permuting the elements in all possible ways
would allow for more potential matching, but through a bit of testing
myself, it doesn't seem to make any difference.

glad to hear you find the same thing. i don't have a CS background and
basically don't even know what i don't know, so your confirmation is a
relief. one day i must stop using the copy of cormen i bought from
amazon as a doorstopper and start at page 1.

CLRS is tough to get into. Goodrich and Tamassia have a much more readable intro to algorithms book, though it isn't nearly as complete.


What *does* make a difference is how the columns are permuted initially,
because as you choose a particular matching, you remove a different set
of choices for the matching. In a quick test among 262,144 6-tuples of
values randomly from 0-15, it was able to reduce it down to 180,475
elements, but randomly permuting the columns and doing it again produced
180,418 elements. Another run got me 180,475 and 180,489 elements.

quite so. based on some earlier suggestions i had constructed a
modified hamming graph linking tuples which differed in only one field
and with the aid of graphviz 'discovered' partial cubes (although
didn't know their name until your recent post) and then eventually
realised that just because i could see structures on the screen didn't
mean that i had a hope in hell of coming up with an algorithm to
extract them :). i'm going to go away and think if there is some smart
way i could at least extract some information from some graph
structure which tells me something about how i should permute columns
to achieve 'optimal' compression.

If you have a reasonable method of going from your tuples to bit strings, using the method of "Approximate Dictionary Queries" by Gerth Stolting Brodal and Leszek Gasieniec from 1996, you can generate all pairs of bit strings that differ by one bit in time proportional to the total number of bits passed.*


So, if you have some time, you can permute the columns in all possible
ways, running each one of them through the algorithm, then
reverse-permute the best result. Or you can randomly permute the
columns a few times, running each through the algorithm, and pick the
best one.

estimate something like 2.8 hours runtime in python to do 720x on
extremal datasets, but may well be fast enough in STL land.
unfortunately the underlying process is time critical and 10-20 mins
would be max acceptable runtime.

So run it for 10 minutes and choose the best result :) That would get you a solid 40 runs or so.

Using a bit of sloppy math related to the harmonic numbers, after about 40 runs, it is probable that after choosing the first, you would have only found better results about 5 or 6 times, and you would only get better results from the remaining 680 possible permutations 5 or 6 times more (basically, the chance of improving a previous result after having tried i permutations already is 1/i, assuming a random permutation as input).


one again thanks for giving me food for thought.

No problem. I'm glad to help.

- Josiah

* Incidentally, I do have a method that does the same thing, is significantly easier to understand and implement, but isn't an online algorithm; you need to give it all of the bit vectors up front. It's not yet published, but if you need this operation and find the Brodal/Gasieneic algorithm to be too difficult to implement, I could likely be convinced to help.
.



Relevant Pages

  • Re: Compressing List of Tuples
    ... because as you choose a particular matching, ... but randomly permuting the columns and doing it again produced ... mean that i had a hope in hell of coming up with an algorithm to ... would be max acceptable runtime. ...
    (comp.compression)
  • String comparison
    ... I'm looking for a function (already done, not just an algorithm) that could do that: ... you give it 2 strings, it returns a percent matching: ... would give 90 % success ...
    (borland.public.delphi.non-technical)
  • Re: Open Source Karaoke?
    ... > finding pitch with certain approximation matching error. ... A diffusion based matching algorithm might work well for spectrogram ... this algorithm for spectrograms, you should, in addition to using ...
    (comp.programming)
  • Re: Language documentation ( was Re: Computing Industry shams)
    ... >> algorithm would have made the concept intuitive. ... > graph search and optimization algorithms, Python's RE matching actually IS ... In regexes, "greedy" match optimizes for the longest match, ...
    (comp.unix.programmer)
  • Re: fingerprint clssification and matching
    ... Hello frnd u can use MATLAB gabor code for fingerprint enhancement. ... I am doing a project about fingerprint classification and matching ... I want please an algorithm about this subject ...
    (comp.soft-sys.matlab)