Re: "Larrabee" GPU design question.
- From: already5chosen@xxxxxxxxx
- Date: Tue, 27 Jan 2009 05:24:07 -0800 (PST)
On Jan 27, 2:36 pm, n...@xxxxxxxxx wrote:
In article <oha356-apc....@xxxxxxxxxxxxxxxx>,
Bernd Paysan <bernd.pay...@xxxxxx> wrote:
Thanks for your example - as you say, it's very like ones that are
fairly common with conventional (real number) matrices. While
small-scale 'vector' units (e.g. SSE) can help, is that the right
way to go?
Also, with that number of cores, simulating a uniformly accessible
memory starts to have major problems (including space and memory
efficiency), so that aspect needs rethinking, too. And that has a
major impact on other parts of the ISA.
Not necessarily, most of these operations can be done with simple loads and
stores on "special" addresses. Maybe you have short addresses that directly
map to the local memory, and long addresses that route to their destination
through a switch network.
Maybe. I don't agree that it would be effective, for complicated
reasons, but let's skip it as there is a simpler example of what I
meant.
A lot of the time, a compiler knows when a location may be accessed
by another thread, and when it cannot be. As, on many designs, the
cache coherence traffic is a major problem - and I know of no way
that it isn't on 256+ cores - even halving the number of coherent
accesses would help. Now, one can do that on a page basis, but that
doesn't match with most language's memory models - one needs it to
be controllable on an access basis.
Regards,
Nick Maclaren.
Actually, halving the number of coherent accesses *that miss outmost
non-shared level of cache hierarchy* would be a huge win but I don't
think that you can get anywhere close to that goal with thread-local
memory. On the other hand, reducing the number of coherent accesses
that
a) hit the private cache
b) never accessed by other cores
even by very significant factor systemwide buys you nothing.
BTW, I suspect you're way too optimistic w.r.t. 256+ cores. IMHO, even
at 32 cores Intel will run into serious cc-related troubles.
If I was an architect I'd limit coherent domains to 4-16 cores and do
explicit message passing above that level. Thinking about it, may even
1 core per coherent domain, but then I would want more than 4 HW
threads per core.
.
- Follow-Ups:
- Re: "Larrabee" GPU design question.
- From: jacko
- Re: "Larrabee" GPU design question.
- From: nmm1
- Re: "Larrabee" GPU design question.
- References:
- "Larrabee" GPU design question.
- From: mike3
- Re: "Larrabee" GPU design question.
- From: Bernd Paysan
- Re: "Larrabee" GPU design question.
- From: nmm1
- Re: "Larrabee" GPU design question.
- From: Bernd Paysan
- Re: "Larrabee" GPU design question.
- From: nmm1
- "Larrabee" GPU design question.
- Prev by Date: Re: "Larrabee" GPU design question.
- Next by Date: Re: "Larrabee" GPU design question.
- Previous by thread: Re: "Larrabee" GPU design question.
- Next by thread: Re: "Larrabee" GPU design question.
- Index(es):
Relevant Pages
|
Loading