Re: High-bandwidth computing interest group
- From: Andy Glew <"newsgroup at comp-arch.net">
- Date: Fri, 23 Jul 2010 17:28:11 -0700
The workday officialy over at 5pm, so I can continue the post I started at lunch. (Although I am pretty sure to get back to work this evening.)
Top quoting without deleting my previous post - so you'll have to scroll way down.
On 7/23/2010 12:01 PM, Andy Glew wrote:
On 7/21/2010 3:18 PM, George Neuner wrote:On Tue, 20 Jul 2010 15:41:13 +0100 (BST), nmm1@xxxxxxxxx wrote:
In article<04cb46947eo6mur14842fqj45pvrqp61l1@xxxxxxx>,
George Neuner<gneuner2@xxxxxxxxxxx> wrote:
ISTM bandwidth was the whole point behind pipelined vector processors
in the older supercomputers. ...
... the staging data movement provided a lot of opportunity to
overlap with real computation.
YMMV, but I think pipeline vector units need to make a comeback.
NO chance! It's completely infeasible - they were dropped because
the vendors couldn't make them for affordable amounts of money any
longer.
Actually I'm a bit skeptical of the cost argument ... obviously it's
not feasible to make large banks of vector registers fast enough for
multiple GHz FPUs to fight over, but what about a vector FPU with a
few dedicated registers?
I have been reading this thread somewhat bemused.
To start, full disclosure: I have proposed having pipelined vector
instructions making a comeback, in my postings to this group and my
presentations, e.g. at Berkeley Parlab (linked to on some web page).
Reason: not to improve performance, but to reduce costs compared to what
is now done now.
What is done now?
There are CPUs with FPUs pipelined circa 5-7 cycles deep. Commonly 2
sets of 4 32-bit SP elements wide, sometimes 8 or 16 wide.
There are GPUs with 256-1024 SP FPUs on them. I'm not so sure about
pipeline depth, but it is indicated to be deep by recommendations that
dependent ops not be closer together than 40 cycles.
The GPUs often have 16KB of registers. For each group of 32 or so FPUs.
I.e. we are building systems with more FPUs, more deeply pipelined FPUs,
and more registers than the vector machines I am most familiar with,
Cray-1 era machines. I don't know by heart the specs for the last few
generations of vector machines before they died back, but I suspect that
modern CPUs and, especially, GPUs, are comparable.
Except
(1) they are not organized as vector machines, and
(2) the memory subsystems are less powerful, in proportion to the FPUs,
than in the old days.
I'm going to skate past the memory subsystems since we have talked about
this at length elsewhere, and since that will be the topic of Robert
Myers' new mailing list. Except to say (a) high end GPUs often have
memory separate from the main CPU memory, made with more expebsive
GDRAMs rather than conventional DRAMs, and (b) modern DRAMs emphasize
sequential burst accesses in ways that Cray-1's SRAM based memory
subsystem did not. Basically, commodity DRAM does not lend itself to
non-unit-stride access patterns. And building a big system out of
non-commodity memory is much more prohibitive than back in the day of
the Cray-1. This was becoming evident in the last years of the old
vector processors.
But let's get back to the fact that these modern machines, with more
FPUs more deeply pipelined, and with more registers, than the classic
vector machines, are not organized as pipelined vector machines. To some
limited extent they are small parallel vector machines - operating on 4
32b SP in a given operation, in parallel in one instruction. The actual
FPU operation is pipelined. They may be a small degree of vector
pipelining, e.g. spreading an 8 element vector over 2 cycles. But not
the same degree of vector pipelining as in the okd days, where a single
instruction may be pipelined over 16 or 64 cycles.
Why aren't modern CPUs and GPUs vector pipelined? I think one of the not
widely recognized things is that we are significantly better at
pipelining now than in the old days. The Cray-1 had 8 gate delays per
cycle. I suspect that one of the motivations for vectors was that it was
a challenge to decode back to back dependent instructions at that rate,
whereas it was easier to decode an instruction, set up a vector,, and
then run that vector instruction for 16 to 64 cycles. Yes, arranging
chaining, and yes, I know that one of the Cray-1's claims to fame was
better scalar instruction performance.
If you can run individual scalar instructions as fast as you can run
vector instructions, giving the same FLOPS, wouldn't you rather? Why use
vectors rather than scalars?
I'll answer my own question: (a) vectors allow you to use the same
number of register bits to specificy a lot more registers -
#vector-registers * #elements per vector. (b) vectors save power - you
onl;y decode the instruction once, and the decoding and scheduling logic
getsd amortized over the entire vector.
But if you aren't increasing the register set or adding new types of
registers, and if you aren't that worried about power, then you don't
need vectors.
But we are worried about power, aren't we?
Why aren't modern GPUs vector pipelined? Basically because they are
SIMD, or, rather, SIMD in its modern evolution of SIMT, CIMT, Coherent
Threading. This nearly always gets 4 cycle's worth of amortization of
instruction decode and schedule cost. And it seems to be easier to
program. And it promotes portability.
When I started working on GPUs, I thought, like many on this newsgroup,
that vector ISAs were easier to program than SIMD GPUs. I was quite
surprised to find out that this is NOT the case. Graphics programmers
consistengtly prefer the SIMD programming model. Or, rather, they
conistently prefer to have lots of little threads executing scalar or
moderate VLIW or short vector instructions, rather than fewer,
heavyweight, threads executing longer vector instructions. Partly
because their problems tend to be short vector, 4 element, rather than
long vector operations. Perhaps because SIMD is what they are familiar
with - although, again I emphasize than SIMT/CIMT is not the same as
classic Illiac-IV SIMD. I think that one of the most important aspects
is that SIMD/SIMT/CIMT code is more portable - it runs fairly well on
both GPUs and CPUs. And it runs on GPUs no matter whether the parallel
FPUs, what would be the vector FPUs, are 16 wide x 4 cycles, or 8 wide x
8 cycles, or ...
Continuing the discussion of the advantages of vector instruction sets and hardware.
Vector ISAs allow you to have a whole lot of registers accessible from relatively small register numbers in the instruction. GPU SIMD/SIMT/CIMT get the same effect by having a whole lot of threads, each given a variable number of registers. Basically, reducing the number of registers allocated to threads (which run in warps or wavefronts, say 16 wide over 4 cycles) is equivalent to, and probably better that, having a variable vector length. Variable on a per vector register basis. I'm not aware of many classic vector ISAs doing this - and if they did, they would lose the next advantage.
Vector register files can be cheaper than ordinary register files. Instead of having to allow any register to be accessed, vector ISAs allow you to only have to index the first element of a vector fast; subsequent elements can stream along with greater latency. However, I'm not aware of any recent vector hardware uarch that has taken advantage of this possibility. Usually they build just a great big wide register file.
Vector ISAs are on a slippery slope of ISA complexity. First you have vector+vector ->vector ops. Then you add vector sum reductions. Inner products. Prefix calculations. Operate under mask. Etc. This slippery slope seems much less slippery for CIMT, since most of these opeerations can be synthesized simply out of the scalar operations that are their basis.
Vector chaining is a source of performance - and complexity. It happens somewhat for free with Nvidia style scalar SIMT, and the equivalent of more complicated chaining complexes can be set up using ATI/AMD's VLIW SIMT.
All this being said, why would I be interested in reviving vector ISAs?
Mainly because vector ISAs allow the cost of instruction decode and scheduling to be amortized.
But also because, as I discusssed in my Berkeley Parlab presentation of Aug 2009 on GPUs, I can see how to use vector ISAs to ameliorate somewhat the deficiencies of coherent threading, specifically the problem of divergence.
.
- Follow-Ups:
- Re: High-bandwidth computing interest group
- From: Terje Mathisen
- Re: High-bandwidth computing interest group
- References:
- High-bandwidth computing interest group
- From: Robert Myers
- Re: High-bandwidth computing interest group
- From: MitchAlsup
- Re: High-bandwidth computing interest group
- From: George Neuner
- Re: High-bandwidth computing interest group
- From: nmm1
- Re: High-bandwidth computing interest group
- From: George Neuner
- Re: High-bandwidth computing interest group
- From: Andy Glew
- High-bandwidth computing interest group
- Prev by Date: High-bandwidth computing (hbc) wiki and mailing list
- Next by Date: Re: Intel and AMD RDMA implementation
- Previous by thread: Re: High-bandwidth computing interest group
- Next by thread: Re: High-bandwidth computing interest group
- Index(es):
Relevant Pages
|