Re: C Hashmap implementation
- From: Gene <gene.ressler@xxxxxxxxx>
- Date: 26 Apr 2007 09:39:44 -0400
On Apr 23, 7:52 am, bison <Sean.Gilles...@xxxxxxxxxxxxxxx> wrote:
Hello. I'm looking into hashmap implementations for a VM based
dynamically typed language. Since the implementation must allow me to
resize on demand, I am a bit confused about the various approaches.
Here is what I have found out so far:
Almost everyone I've talked to has said that Chained Hashmaps are much
easier to implement than Open Addressed maps.
Wikipedia suggests that an approach to resizing hashmaps is to
allocate space for a newer hashmap and copy elements from to the new
table, and in some cases do it incrementally.
Quick question about the last point: I'm curious about a starting
point. How much space should a hashmap allocate initially, and when
it's full...by how much would I typically increase it? I realize
there are lots of different answers here, so a good starting point
would really help out.
You'll want to grow the table by a _factor_ greater than one so that
the amortized time spent on copying does not dominate asymptotic
performance. I have seen factors between 1.3 and 2 used in
practice.
Most hash functions work best with prime table sizes. The following
table supports growth factors of roughly 1.5 for 32-bit architectures.
unsigned primes[] = {
3, 7, 11, 19, 29, 43, 67, 97, 149, 211, 317, 479, 719, 1069, 1601,
2399,
3607, 5399, 8093, 12143, 18211, 27329, 40973, 61463, 92173,
138283,
207401, 311099, 466619, 699931, 1049891, 1574827, 2362229, 3543349,
5314961, 7972451, 11958671, 17937989, 26906981, 40360471, 60540749,
90811057,136216589, 204324851, 306487277, 459730919, 689596379,
1034394589, 1551591841, 2327387723u, 3491081599u, 4294967291u,
};
Initial size is totally application dependent. E.g. if you are likely
to have have a million tables with just a few items in them, then you
probably want to choose a very small initial value. If every table is
likely to be loaded with N >> 1 elements, starting with small tables
wastes O(N) time for copying. And, because you are allocating fresh
storage for each growth cycle, a non-compacted heap may quickly become
fragmented.
A rule of thumb is to pick a total memory size you're willing to waste
as empty table space. Call this W. Then decide the number of tables
your application is likely to have concurrently allocated. Call this
N. Then set starting size to max(1,floor(W/N)), or the nearest
prime. One possible choice for W is some modest fraction of the L2
cache, the rationale being that if you have a bunch of nearly empty
tables of size comparable to a cache line, and you hit them in random
order, you will render the cache useless. Another idea: pick W to be
a modest fraction of physical memory size. Ditto the rationale above
but for pages and swapping rather than cache lines and misses.
If map size is very unpredictable, consider self-balancing trees - red-
black or AVL. If you are using the maps to look up identifiers, you
should only look up each one once and store a Boehm number for future
refs. This renders the performance penalty of trees unimportant,
while the space saved just might recoup that penalty and more.
If you need to delete elements, then yes, chained hashes are a far
easier to implement. If you don't need deletion, open addressing with
a quadratic probe is quite simple, usually effective, and considerably
more economical. Linear probing, which makes deletion simpler, can
magnify any quirks in your hash function.
Cheers!
.
- References:
- C Hashmap implementation
- From: bison
- C Hashmap implementation
- Prev by Date: Re: C Hashmap implementation
- Next by Date: Re: 32-bit vs. 64-bit x86 Speed
- Previous by thread: Re: C Hashmap implementation
- Next by thread: Re: C Hashmap implementation
- Index(es):
Relevant Pages
|
|