Re: Implementation of pop count instruction



"Dan Koren" <dankoren@xxxxxxxxx> wrote in message
news:43390d28$1@xxxxxxxxxxxxxxxx
> "Laurent Desnogues" <l-desnogues_delme@xxxxxx> wrote in message
> news:dhav2r$lrf$1@xxxxxxxxxxxxxxxxxx
>> Dan Koren wrote:
>> [...]
>>>
>>> Take a look at http://aggregate.org/MAGIC and
>>> www.hackersdelight.org for various ideas on
>>> how to implement POPC using bit shifts and
>>> bitwise logical operations. I have recently
>>> coded and compared all the POPC algorithms
>>> presented there. Not surprisingly it turns
>>> out that the fastest one is to do byte by
>>> byte lookups through a 256-entry table,
>>> which can be easily parallelized and
>>> implemented in hardware. Just to get an
>>> idea, a double word (8 byte) POPC takes
>>> 1.67 ns on a 1.2 GHz Ultra SPARC III. I
>>> got similar results on PA-RISC: 2.08 ns
>>> at 900 MHz.
>>
>> You really should take results measured in tight loops with
>> a grain of salt when using table lookups. It often happens
>> that when your code is put in real code (as opposed to benchmark
>> loop code) the result is very different due to cache trashing,
>> even when the table is small.
>
> Hhmmm.... Do you realize the
> entire table is 256 bytes and
> fits completely in one or two
> cache lines?
>


Not to mention that this code is
executed so frequently by the
library routines that rely on
it that it stays in the cache.

When did you last check cache
sizes for large server systems?

The HP system on which I test
my libraries has 3 MB L1 cache
per chip (1.5 MB per core) and
32 MB L2 cache (shared).



dk


.



Relevant Pages

  • Re: Implementation of pop count instruction
    ... > library routines that rely on ... > it that it stays in the cache. ... > sizes for large server systems? ... I used to do absolutely everything via lookup ...
    (comp.arch)
  • Re: Implementation of pop count instruction
    ... >> library routines that rely on ... >> it that it stays in the cache. ... unless popcount is really the bottleneck of your ... > And for sure, just benchmarking popc ...
    (comp.arch)
  • Re: Implementation of pop count instruction
    ... executed so frequently by the library routines that rely on it that it stays in the cache. ... sizes for large server systems? ... And for sure, just benchmarking popc ...
    (comp.arch)