Re: Implementation of pop count instruction



Stephen Fuld wrote:

> I am not a hardware guy, and this certainly isn't a homework problem, so
> please be gentle with me. :-)
>
> I have been thinking about how a pop count instruction would be
> implemented
> in hardware. I have come up with several possibilities, but none of them
> seem "optimal".
>
> One could have a tree of adders - so for a 64 bit register, 32 two bit
> adders feeding 16 four bit adders, etc. This works, but takes six cycles,
> which seems like a lot. (maybe it doesn't matter given how infrequently
> such an instruction is executed.)
>
> One could have a look up ROM, but it would have to be pretty big to save
> any
> time. For example, with a 64K by 5 ROM, one could do it in 4 lookups and
> pipeline all but the last the add, but this still takes five cycles. If
> you could use a dual ported ROM (are there such things?), you could cut it
> down to about 3 cycles.
>
> You could just use a humongo hunk of random logic and get the answer in a
> single cycle, but the amount of logic required seems to be way to large to
> be practical.
>
> I guess if you allowed analog circuitry, you could somehow input all the
> signals into some kind of A to D converter, but that seems to way out.
>
> So, for real CPUs that have this instruction, how is it implemented?
>

For the one that I have personal knowledge of: an initial layer of (4,2)
compressors then some fast adders in a tree as you describe.

32 bits -> 8 x (4,2) compressors -> 4 x 2 bit adders ...

- Derek




.



Relevant Pages

  • Re: Implementation of pop count instruction
    ... > I am not a hardware guy, and this certainly isn't a homework problem, so ... > adders feeding 16 four bit adders, ... > pipeline all but the last the add, but this still takes five cycles. ... I assume here that an adder has 3 equal delays but thats not ...
    (comp.arch)
  • Implementation of pop count instruction
    ... I am not a hardware guy, and this certainly isn't a homework problem, so ... I have been thinking about how a pop count instruction would be implemented ... pipeline all but the last the add, but this still takes five cycles. ... could use a dual ported ROM, ...
    (comp.arch)
  • Re: Implementation of pop count instruction
    ... > I am not a hardware guy, and this certainly isn't a homework problem, so ... > I have been thinking about how a pop count instruction would be implemented ... > adders feeding 16 four bit adders, ... The most obvious idea would be a set of carry-save adders, ...
    (comp.arch)
  • Re: Implementation of pop count instruction
    ... >> I am not a hardware guy, and this certainly isn't a homework problem, so ... >> adders feeding 16 four bit adders, ... >> cycles, ... > something like 3-4 gate delays. ...
    (comp.arch)
  • Re: x86 Instruction Reference - review appreciated
    ... on a modern processor. ... only one of them can decode every possible instruction. ... there are a number of execution units, and a number of ports to feed ... on old processors the difference between peak cycles ...
    (comp.lang.asm.x86)