Re: the pipeline of C64x+



joggingsong@xxxxxxxxx wrote:
Hi,all
I am a DSP engineer for a few years. But till recently
I have a chance of programming on C64x+ from TI.
One difference of C64x+ from other DSPs I have used
is the pipeline. The execution stage of pipeline is
variable length depending on instruction type.

Well - Yes and no.

The pipeline executes one VLIW instruction package per cycle. Each of this packages may have instructions for each of the eight execution units. If you issue multi-cycle instructions inside such a package that's not a problem as you can still issue one instruction per cycle for given unit even if the execution unit is currently calculating something different (with the exception of writes to the register file).

All you see as a programmer is just the fact that the result appears later in the destination register. Execution-units are pipelined, so they never block.


This kind of execution stage of pipeline is more similar to
the execution stage of pipeline of x86.

Hm. I think they are very different. On the C64x+ there is no out of order execution going on. The DSP does exactly what you wrote in the code and it will punish you in subtle ways if you ignore one or another scheduling rule. IIRC the DSP has only two ways to stall the execution, and it will stall the *entire* core if so. One is a cache-miss, and the other is a register-read over a crossbar from a register that has been written a cycle ago.

Otoh the x86 does not even remotely look like the instruction set. x86 instructions are decoded into micro-ops and executed by a parallel risc like execution engine. The latency for an x86 instruction or the execution speed of a loop is not deterministic on x86 while it is deterministic on the c64x+


In my opinion, the pipeline of C64x+ makes it difficult to
> programming in assembly code.

I second that. I had my share of assembler coding on the C64x+ and it's a lot of work. If you want a performance that you can't get with C be prepared to spend hours starring at the code and find ways to be better than the compiler. Even for simple things. The architecture hasn't been designed with hand-coding in mind.

Usually you write your code in C or C++ and let the compiler do the hard work. Assembler knowledge is still recommended though. The compiler can generate assembler-outputs with annotations and scheduling-statistics for each loop. If you want good speed you have to take a look at the output, find resource bottlenecks and sometimes rearrange the code to use a different execution unit. Understanding what's going on is essential here.


> If assembly code is needed, linear assembly is recommended.
Can I have your opinion about this?

If you need assembler, then use the linear assembler. It makes life much easier compared to raw assembler, but it's still a PITA.

Besides productivity there are three good reasons to stay away from assembler as much as possible on that architecture:


1. It's easy to tune the C-code. Use the intrinsics for all those operations that can't be expressed in C and read the comments in the assembler output. With a bit of practice you should be able to utilize nearly all of the execution units that way.

2. The performance of a loop is often not limited by the CPU-cycles, but by cache misses and memory bandwidth issues. If you start optimizing it's a good idea to first replace a loop with a very simple dummy-loop that has the same memory access pattern (adding all the stuff together or so). Benchmark that dummy-loop, divide by iterations and you get an estimate of a cycles per iteration budget.

Once you have that budget take a look at the compiler output. You will most likely find out that C with some intrinsics and sometimes a bit of resource-balancing will get you to this mark with ease. Optimizing beyond this point makes no sense except for the fun of it. The loop will not execute any cycle faster.


3. Some loop-constructs have an influence of the interrupt latency as they disable interrupts for a while. The compiler knows about this, and if you want to change the latency all you have to do is to pass a command-line option. That won't work with hand-written assembler.

Now imagine that you're halfway through your optimizations. The door opens, a hardware-guy enters your room and tells you that you have to lower the latency by 1000 cycles because some piece of external hardware starts to behave funky. With C you do a recompile, with ASM you have to touch each and every loop you wrote. :-)


Btw - regarding C vs. C++: They work equal well if you know what you're doing. The C++ support is adequate. Classes work and templates work. If you just use C++ features to organize your code into classes and and don't use exception handling or RTTI the code will perform equal well than C-code.

Cheers,
Nils
.



Relevant Pages

  • Re: HLA History
    ... Maybe you are right, DOS is obsolete ... listing generated by the assembler. ... execution of code belonging to the operating system (in this ... the instruction after the int instruction. ...
    (alt.lang.asm)
  • Re: Any resources on VLIW?
    ... instructions to execute in parallel than you have actual functional units ... has varied from multiflow-style VLIW is that they've made the instruction ... (I doubt that the assembler ... VLIW execution engine with a NOP-compression scheme in the fetch logic, ...
    (comp.arch)
  • Re: [RFC][PATCH] x86: make text_poke() atomic
    ... Unexpected Instruction Execution Results ... But given int3 ... IPI to _each_ CPU to make sure they issue a synchronizing instruction ...
    (Linux-Kernel)
  • Re: Adjusting PC Hyperthreading for Spice Simulation
    ... ago), 350 CPU cycles for a code cache miss was not atypical, but RAM ... and others) support speculative execution and out of order execution ... Kindly explain how you get past the previous instruction to begin ...
    (sci.electronics.design)
  • Re: Opteron versus P4
    ... Athlon has a three-way fully pipelined FPU. ... micro-benchmarks with the x87 instruction set, ... down its FP execution units at a rate of one per clock. ... the Athlon can achieve twice the execute ...
    (borland.public.delphi.language.basm)