Re: avoiding glTranslatef*
- From: aku ankka <jukka@xxxxxxxxxxxx>
- Date: Tue, 4 Dec 2007 23:26:53 -0800 (PST)
On Dec 5, 6:53 am, aku ankka <ju...@xxxxxxxxxxxx> wrote:
I didn't say it takes single vertices. I just said that a specific vertex
attribute can be sent to the GPU once for many vertices.
No, it doesn't work that way. They don't have RLE compression or
similar. Each vertex is unique entity burning memory footprint and
bandwidth to transfer, sorry to burst the bubble. :)
On the brighter side, that would be a great idea if the hardware was
so simple, that "vertex" would be a internal hardware-level command.
Some registers on-chip would retain the fixed-function states and use
those to generate output to the rasterizer. glColor*() call would
generate "color" command in the command stream, this command would
write into the current color register. And so on.
This is how it used to be. In that kind of hardware it would make
sense. But let's look at the command layout, I try to minimize the
size:
The format is:
uint8 command;
uint8 data[...]
If we have fixed-size commands, it means we need at least 17 bytes per
command (command ID + GLfloat[4]), so even simple glColor4ub which
only has 4 bytes of data payload would consume 16 bytes worth of data
slots. Inefficient for bus utilization. Let's "compress" the commands;
the data size is command specific. This means the parsing logic has to
be bigger, at simplest it could be a table.. each command ID has a
table entry:
struct cmdInfo
{
uint32 func; // address or offset of command handler microcode
uint8 size; // data size
};
Then the 8 bit command ID would be used to index the cmdInfo table,
this assumes programmable CP. The command ID would be 8 bits, but say,
only 6 lsb would be significant so the table size could be reduced.
There might not be CP at all, and everything would be fixed, that
would also work, no problem.
OK, so the command payload for glColor4ub would be:
uint8 command; // = "color(4,ubyte)"
uint8 data[4];
glNormal3f would be:
uint8 command; // = "normal(3,float)"
float data[3];
And so on, we notice a trend: for the glColor4ub the command overhead
would be 20%, for the Normal3f the overhead would be 6.25% and so on.
The modern 3D chip has hundreds of ALU's and a _LOT_ of processing
power. What they need is more memory bandwidth, even local bandwidth.
Samsung is going to mass produce GDDR5 memory Q1/08 and even that
isn't enough yet, but improvement still. :) Even worse, this
arrangement that each command comes from *client* system, it means it
has to go through a bus such as AGP or PCI Express which is even lower-
bandwidth than the graphics card's local memory.
We could assume UMA where all memory is local, but that's even WORSE;
the same memory is shared by host CPU and graphics processor, this
means there is even LESS available bandwidth. Also, it is more tricky
to organize memory into banks that can be read simultaneously.. it's
easy to issue 100's of memory read request into a controller but more
difficult to handle them all with only one memory bus... you'll be
limited and that's that. If you ditch UMA in favor of dedicated bus
for different units, it's more expensive to add routes.. if the memory
is off-chip you need more pins, if the memory is on-chip it takes
gates to implement, bigger chip, smaller yield, more expensive end
product. It also makes the chip more power-hungry and makes it run
hotter and so on.
You have to design so that you get a balanced system, it doesn't pay
off to have one part so fast that other components cannot satisfy it's
requirements (like memory bandwidth).
So memory bandwidth is in critical role, we want so save it. 6.25%
saved memory bandwidth usage translates into 6.25% faster rendering
pretty much linearly, because it is limiting factor. So we want
instead to have array of data without the command overhead, so, it
doesn't really make sense to have the "immediate legacy mode" logic on-
chip. We go with design that is more efficient instead. This means the
immediate mode is implemented in such a way that it feeds the improved
render path. It becomes a proxy!
As a proxy, it doesn't fulfill any useful function from the driver's
point of view. Developer's point of view is that it is *convenient*,
in this kind of thinking it can be easily implemented as external
library.. it does the SAME THING the driver is doing, except it can be
made *more* efficient as it doesn't have to check for things like "do
we have current context if so direct data there yada yada yada...)
FWIW, 8 bits, "bytes" and so were used only as examples, in hardware
we use 3 bit, 5 bit, 11 bit, etc. etc. data types busses and data
paths all the time happily. It's usually the client data that is
multiples of 8 bits. Since the command stream is often built in the
driver running on the CPU, the commands have distinct granularities,
say, multiples of 32 bits, or multiples which depend on a lot of
factors.. some systems do only aligned memory reads in 512 bit blocks,
so it makes sense to take use of that, or organize data so that it
makes the issue moot (eg. "big packets" where little waste is
insignificant) and so on and on... you can easily drown in the
details, "where to begin to explain" can be a bit of a problem.
.
- Follow-Ups:
- Re: avoiding glTranslatef*
- From: John Tsiombikas
- Re: avoiding glTranslatef*
- References:
- avoiding glTranslatef*
- From: RumbaDancer
- Re: avoiding glTranslatef*
- From: Wolfgang Draxinger
- Re: avoiding glTranslatef*
- From: fungus
- Re: avoiding glTranslatef*
- From: Wolfgang Draxinger
- Re: avoiding glTranslatef*
- From: jbwest
- Re: avoiding glTranslatef*
- From: Gernot Frisch
- Re: avoiding glTranslatef*
- From: aku ankka
- Re: avoiding glTranslatef*
- From: Rolf Magnus
- Re: avoiding glTranslatef*
- From: Wolfgang Draxinger
- Re: avoiding glTranslatef*
- From: Rolf Magnus
- Re: avoiding glTranslatef*
- From: aku ankka
- avoiding glTranslatef*
- Prev by Date: Re: avoiding glTranslatef*
- Next by Date: Re: avoiding glTranslatef*
- Previous by thread: Re: avoiding glTranslatef*
- Next by thread: Re: avoiding glTranslatef*
- Index(es):
Relevant Pages
|