Re: rewrite glRotate?



On Oct 30, 8:34 pm, Krzysiek So?ek
<ksolek191.USU...@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
Wolfgang Draxinger pisze:

Also all matrix functions of OpenGL are not HW accelerated
anyway, so you can actually gain some performance if you do
reimplement them yourself.

Hello,
I don't know if I understand this correctly.
Is that mean, that 32bit CPU makes matrix multiplication operations
faster that 256bit GPU?

Regards
Krzysiek

It doesn't matter as long as the GPU does not collapse the matrix
stack.

The strength of a GPU is parallelism; you have hundreds of multiply-
accumulator units running in parallel, that's great, but collapsing
matrix stack isn't task that benefits from such arrangement very much.

GPU's parallelism is great when you have 1000's and millions of
vertices and fragments that all execute the same program. The setup
like matrix stack collapse is one-time-per-draw command overhead. Now,
transferring ALL the matrix stack and commands what to do with it into
the GPU´'s local memory only to do a few multiply-accumulate ops isn't
really worth the hassle.

The modern x86 architecture has SSE instructions, which use 128 bit
(xmm) registers. The 32 bits you mention gives me a mental image of
x87 floating point stack. The 256 bits on the other hand draw me a
blank stare, most GPU's process either 32 bit scalars (scalar engine)
or 4 x 32 bit scalars (vector engine) at a time. My best guess is that
you mean that some specific GPU has 256 bits wide internal bus..?

The scalar engines are becoming more common, there was a few years
when the "SIMD" was all the rage in the GPU design but it's wearing
off; data the applications send for processing is often vec3
(position) and vec2 (texcoords) and, well, generating optimum code so
that all scalar elements of vec4 are utilized is not easy task for a
compiler.. with 'easy' mean impossible with a lot of code that is
written by developers.

It would've been more, let's say, inviting for developers to
"optimize" code to use SIMD if they only had vec4 in the GLSL. Well,
that wasn't the case, the float, vec2 and vec3 were also exposed
because those are basic data types all applications used since year
rock and scissors.

So what the hardware guys did was to make scalar engine instead, where
there is a crossbar which distributes the computations to scalar
units. vec3+vec3 operation would consume precisely _3_ scalar units.
No waste. The same operation would consume one vector unit, wasting
25% of the unit, unless there was some other scalar add operation that
could be fused with the vec3+vec3, but then, the data would have to be
swizzled so that it ends up in the same register before the operation
and so on. This kind of thing can get really nasty and increase number
of ALU instructions that are generated again degrading the
performance. So you lose-lose no matter what approach you take.

A scalar engine, on the other hand, is always optimal. The downside is
that you need extra logic to implement the crossbar, but on the other
hand, the idea is that the extra logic used for that puts the units
into better use and you get better return for the investment. This is
where the "unified shaders" come into picture; since all computation
goes into this scalar alu array, the fragment and vertex programs are
just interface... problem with this is that dedicated fragment alu
instruction could be tighter in implementation, again the generic
unified ALU design is slightly waste from that point of view, but when
you throw enough power at it, at least all of it can be utilized.

So from this angle the 256 bit wide (whatever you meant by that)
statement dies off, it's actually 32 bit scalar vs. 32 bit scalar /
128 bit 4 x 32 bit scalar competition at best. The GPU has a lot more
ALU's to do the computations; it wins every time in parallel
computation.

Problem with generic CPU is the issue rate; the processing is serial
and each instruction is dependent on the state of the processor before
that specific instruction. You are limited by the rate you can feed
the CPU work to do. The GPU doesn't have this bottleneck from the
practical point of view.

Back to the matrix stack collapse; it is not problem you can very
easily parallelize, first, it is serial like fragment program for one
specific fragment for example, the problem is, there aren't more than
one instance of this program being executed. So from that point of
view suddenly the CPU isn't at very great disadvantage at all. Also,
doing this with the CPU keeps the GPU side of the driver code one bit
simpler. Simpler without any performance disadvantage is a good thing.

Hope this gives the right mental image of the situation. Good luck.

.