Re: "Larrabee" GPU design question.



On 27 Jan, 13:24, already5cho...@xxxxxxxxx wrote:
On Jan 27, 2:36 pm, n...@xxxxxxxxx wrote:





In article <oha356-apc....@xxxxxxxxxxxxxxxx>,
Bernd Paysan  <bernd.pay...@xxxxxx> wrote:

Thanks for your example - as you say, it's very like ones that are
fairly common with conventional (real number) matrices.  While
small-scale 'vector' units (e.g. SSE) can help, is that the right
way to go?

Also, with that number of cores, simulating a uniformly accessible
memory starts to have major problems (including space and memory
efficiency), so that aspect needs rethinking, too.  And that has a
major impact on other parts of the ISA.

Not necessarily, most of these operations can be done with simple loads and
stores on "special" addresses. Maybe you have short addresses that directly
map to the local memory, and long addresses that route to their destination
through a switch network.

Maybe.  I don't agree that it would be effective, for complicated
reasons, but let's skip it as there is a simpler example of what I
meant.

A lot of the time, a compiler knows when a location may be accessed
by another thread, and when it cannot be.  As, on many designs, the
cache coherence traffic is a major problem - and I know of no way
that it isn't on 256+ cores - even halving the number of coherent
accesses would help.  Now, one can do that on a page basis, but that
doesn't match with most language's memory models - one needs it to
be controllable on an access basis.

Regards,
Nick Maclaren.

Actually, halving the number of coherent accesses *that miss outmost
non-shared level of cache hierarchy* would be a huge win but I don't
think that you can get anywhere close to that goal with thread-local
memory. On the other hand, reducing the number of coherent accesses
that
a) hit the private cache
b) never accessed by other cores
even by very significant factor systemwide buys you nothing.

BTW, I suspect you're way too optimistic w.r.t. 256+ cores. IMHO, even
at 32 cores Intel will run into serious cc-related troubles.
If I was an architect I'd limit coherent domains to 4-16 cores and do
explicit message passing above that level. Thinking about it, may even
1 core per coherent domain, but then I would want more than 4 HW
threads per core.

1 hardware thread per core, tiny core, memory local to a core, cores
exchange state so execution migrates to nedded memory, local super
fast stack cache (*4) (say 32 bytes) keeps cores executing in the
process of migration to core with needed memory. Garbage collector
takes on role of comilier reordering for localization of code and data
to some extent, but not necessary for lower efficiency operation.
Profiling reorganization... With hiearchy switch register set is about
32 clocks away from any memory in a 16GB address space. No cache
coherency as no self modifying code and stack overlap ignored for even
better crash avoidance. Any data stall much better than a coherance
flush.

cheers jacko
.



Relevant Pages

  • Re: Verbose functional languages?
    ... Speaking of multiple cores: when I look at what's Intel talking about, ... Even memory allocation would create all kinds of mutual wait ... I'm less convinced about general use of semi-space moving collectors ...
    (comp.lang.functional)
  • Re: Target market for Intellasys.
    ... I was wrong about that Ambarella chip, it's average power requirements are more than I thought. ... With the 1 transistor dram, the substrate acts as a capacitor, so theoretically you get many times more memory density, good speed etc. ... I for one would be dropping in 10+DACS, extra processors, extra memory, and if available 36bit processor cores and full external SRAM memory buss mapped to one core. ... But such a scheme would allow customers to easily order a module populated with a desired amount of memory cores, and it would cost intellasys a lot less than putting memory on the processor. ...
    (comp.lang.forth)
  • Re: "Larrabee" GPU design question.
    ... memory starts to have major problems (including space and memory ... that it isn't on 256+ cores - even halving the number of coherent ... halving the number of coherent accesses *that miss outmost ... non-shared level of cache hierarchy* would be a huge win but I don't ...
    (comp.arch)
  • Re: Straight-Up Replacement For P4P800-E?
    ... work with a couple SATA drives, and use it's memory. ... The reason I did it the way I did, is Win2K only supports 2 cores, ... Expected System Bus Frequency: 800 MHz ... there is a "hidden warranty" for ICH5/ICH5R failures. ...
    (alt.comp.periphs.mainboard.asus)
  • Re: Suggestion on computer for synthesis and simulation of FPGA
    ... Will DDR3 memory have an advantage over DDR2 memory for FPGA ... FPGA placing and routing (I don't know about simulation) is *mostly* single-threaded. ... So unless your simulator is fully multi-threaded, anything more than 2 cores will not speed up FPGA work. ...
    (comp.arch.fpga)

Loading