Re: Perl hashing speedup ?



Hi,

Yitzchak Scott-Thoennes wrote:
If you are able to upgrade to a recent 5.8.x release, you may see an
improvement, if part of your problem is that hash bucket usage is
disbalanced.


I feel that disbalanced bucket usage / and less number of hash buckets
might be reason for this slow hash creation process.

Regarding this multi-level hash creation at level-1 itself we are
having 209249 unique keys.

Now I have some queries regarding way perl hashing mechanism maps keys
to buckets.
I am not too sure how many default number of buckets will be used to
map this. If number of buckets are very less then it would result in
lots of hash collisions and that will slow down entire thing ( that's
what I feel -:) )

Now can we have this info:
By default perl will be using how many buckets?
Is there any way we can increase numbers of buckets used by perl
hashes?

I got some info from http://www.perl.com/pub/a/2002/10/01/hashes.html
about how perl hashes work internally; but I guess we will be needing
information how to tweak perl hashes for this kind of requirement.

If you can see following stats while inserting initial set of keys (for
310,000 records) things were going quite fine, then there is gradual
increase for subsequent records. I feel this may be due to heavy hash
collisions. Then if you observe stats for record numbers (710000 to
770000); things were going again back on track; although right now I am
not able to confirm reason for same but I feel that this may be the
case where records were getting mapped to relatively non-collision hash
buckets. Hopefully I am thinking in right direction; if that being the
case if we can somehow increase number of hash buckets then things
should go fine.

Following are some stats; when I am creation only one level hash (using
only Key-1)
Rec Nos Time in seconds
10000 : Time: 3
20000 : Time: 2
30000 : Time: 3
40000 : Time: 3
50000 : Time: 3
60000 : Time: 3
70000 : Time: 3
80000 : Time: 3
90000 : Time: 4
100000 : Time: 3
110000 : Time: 3
120000 : Time: 3
130000 : Time: 3
140000 : Time: 2
150000 : Time: 3
160000 : Time: 3
170000 : Time: 3
180000 : Time: 3
190000 : Time: 3
200000 : Time: 3
210000 : Time: 2
220000 : Time: 4
230000 : Time: 3
240000 : Time: 3
250000 : Time: 3
260000 : Time: 4
270000 : Time: 3
280000 : Time: 4
290000 : Time: 3
300000 : Time: 4
310000 : Time: 4
320000 : Time: 5
330000 : Time: 5
340000 : Time: 5
350000 : Time: 6
360000 : Time: 6
370000 : Time: 7
380000 : Time: 7
390000 : Time: 8
400000 : Time: 8
410000 : Time: 8
420000 : Time: 11
430000 : Time: 12
440000 : Time: 12
450000 : Time: 14
460000 : Time: 14
470000 : Time: 14
480000 : Time: 14
490000 : Time: 14
500000 : Time: 17
510000 : Time: 17
520000 : Time: 20
530000 : Time: 18
540000 : Time: 21
550000 : Time: 21
560000 : Time: 22
570000 : Time: 22
580000 : Time: 22
590000 : Time: 24
600000 : Time: 26
610000 : Time: 24
620000 : Time: 26
630000 : Time: 25
640000 : Time: 27
650000 : Time: 27
660000 : Time: 27
670000 : Time: 29
680000 : Time: 30
690000 : Time: 30
700000 : Time: 32
710000 : Time: 5
720000 : Time: 3
730000 : Time: 3
740000 : Time: 3
750000 : Time: 3
760000 : Time: 2
770000 : Time: 3
780000 : Time: 17
790000 : Time: 35
800000 : Time: 38
810000 : Time: 36
820000 : Time: 38
830000 : Time: 36
840000 : Time: 39
850000 : Time: 40
860000 : Time: 40
870000 : Time: 41
880000 : Time: 42
890000 : Time: 41
900000 : Time: 41
910000 : Time: 43
920000 : Time: 43
930000 : Time: 41
940000 : Time: 43
950000 : Time: 41
960000 : Time: 46
970000 : Time: 44
980000 : Time: 45
990000 : Time: 46
1000000 : Time: 45
1010000 : Time: 43
1020000 : Time: 44
1030000 : Time: 47
1040000 : Time: 46
1050000 : Time: 44
1060000 : Time: 48
1070000 : Time: 49
1080000 : Time: 50
1090000 : Time: 50
1100000 : Time: 50
1110000 : Time: 52
1120000 : Time: 50
1130000 : Time: 47



Thanks
Sujit





Yitzchak Scott-Thoennes wrote:
sujit wrote:
This is regarding perl hashing. I am creating a 3 level hash (actually
2 such hashes at a time) with approx. 2.6 million entries.

Regarding length of each key;

Level 1 key length - Approx 40 chars
Level 2 Key length - Approx 12 chars
Level 3 key length - Approx 40 chars

Now this hash creation process itself is taking around 3.5 to 4 hrs for
me.

Is there any way to improve per default hashing function for speed-up
(or some better way to approach this requirement)? Any URL's which
could be helpful on same are welcomed -:)

Perl version information-
perl, v5.6.1 built for MSWin32-x86-multi-thread

If you are able to upgrade to a recent 5.8.x release, you may see an
improvement, if part of your problem is that hash bucket usage is
disbalanced.

Another idea would be to revert to perl4-style "emulated" multi-level
hashes; see the syntax description in

http://perldoc.perl.org/perlvar.html#%24%3B
.



Relevant Pages

  • Re: Hashing
    ... What you have to keep in mind is that a hash table for words in by ... linear testing of buckets. ... What is important is to ensure that you have an effective collision ... memory for the table, you need two arrays, one is the pointers to the ...
    (alt.lang.asm)
  • Re: Hash table in C++? STL?
    ... > a tree-like structure instead of lists for the buckets). ... > the standard will require equality or a relational comparator. ... the hashing function is where all the complexity lies and I'd be ... But we need hash only the first maxchars characters. ...
    (comp.lang.cpp)
  • Re: Perl hashing speedup ?
    ... Look at "perldoc -f keys". ... You might try pre-allocating more buckets to the hash: ... my WAG is that Perl starts with a ...
    (comp.lang.perl.moderated)
  • Re: SGI hash_map
    ... >> studied hash functions you'll find on the web. ... > And before inserting into the map, each hash value is calculated with: ... a uniform distribution. ... a program to count the number of buckets that contain multiple items. ...
    (comp.lang.cpp)
  • Re: Hash table in C++? STL?
    ... I wouldn't dare claim to be either, without doing some research first. ... by the number of items that hash to that index. ... perhaps you meant "the size measured in number of buckets" and I was ... second hash function is used to determine the sequence of slots to try. ...
    (comp.lang.cpp)