Re: Performance and Flash Pipelining on TI 28F12 DSPs
- From: Jack Klein <jackklein@xxxxxxxxxxx>
- Date: Tue, 27 Sep 2005 23:09:57 -0500
On Tue, 27 Sep 2005 14:28:47 -0400, Roberto Waltman
<usenet@xxxxxxxxxxxx> wrote in comp.dsp:
> We are developing an embedded controller based on a TI 28F12.
> This new product consists of single board replacing an older system
> containing 8 (different) boards, each one running code on an Intel
> 80196 16-bit microcontroller.
>
> The software (painstakingly translated from '196 assembler into C &
> C++) is typical microcontroller stuff, as opposed to DSP stuff (No FIR
> filters, FFT's PID loops, no need for vector/MAC instructions, etc.)
>
> In short, we are using the 2812 as a fast general purpose
> microcontroller, with plenty of on-chip memory and peripherals.
> The CPU was selected before I joined the project, based on the
> assumption it will be fast enough to do the job of the 8-gang 80196s.
>
> The question now is, how fast the CPU really is? We are about to start
> writing some benchmarking code, but we do not have enough of the code
> that will be running in the final application to get the full picture.
>
> The software structure does not allow us to follow the classical
> approach of running time critical loops from internal RAM instead of
> Flash.
> Due to the functions implemented and the fact that we are running the
> equivalent of 4 different older products (the 8 controllers we are
> replacing were not identical) there isn't a small well defined segment
> of "critical code" we could move to RAM.
>
> So, my question to the group: What kind of degradation from the
> theoretical figure of 150 MIPS can we expect running code that does
> not take advantage of the special DSP instructions and running only
> from internal flash?
>
> We will be accessing both onboard and outboard peripherals and need to
> service several periodic interrupt sources, the fastest being 500 (1
> source) and 125 (2 sources) microsecond rates.
>
> Also, how big an impact has the "Flash Pipeling" ? Without it the
> effective instruction execution rate is around 25Mhz, which puts us in
> the same ballpark as the aggregated performance of the original
> system, and probably will not meet our goals because the SW is now in
> C instead of assembler, and some new functionality was added.
>
> Does anybody have a reference point from a similar scenario?
>
> Thanks,
>
>
> Roberto Waltman
Hello Robert,
We've got two motion control boards using the 2812 in our new product
line that began shipping a few months ago.
Both of our boards have a 128K x 16 fast SRAM mapped into the external
memory zone 7. This has two big advantages, one for the product and
one for development.
The advantage for the product is for firmware upgrades in the field.
Since the external RAM is as big as the internal flash, the firmware
receives the entire new flash image and stores in RAM, and verifies
that the image is good, before it erases the current contents of the
flash and reprograms it.
The other advantage is for development/debugging with a JTAG debugger,
since we can load the test code directly into the external RAM and run
it from there without programming the flash.
The reason I mention the external RAM, which may be of no real
interest to you, is that it sheds some light on pipelined flash speed.
The read timing for external memory has a long setup time on data in
to the processor, so we end up running the external RAM cycles for a
total of 4 CPU clocks. And of course the flash requires 5 wait
states, 6 clock cycles total, which you already know.
You would think that external RAM would be faster than the flash, by a
factor of about 1.5, and it is when flash pipelining is turned off.
But with flash pipelining turned on, code running from flash is
noticeably faster than the same code running from RAM. We've never
made any specific attempts at timing it, because it is not important
for our application, but it appears to be at least twice as fast.
I first noticed this when writing the critical error handler, that
disables all interrupts and flashes an error code on the pair of
7-segment LEDs on board. It turns off the PLL when it runs. Since
the timers and all other interrupts are shut off, the timing of this
display is done by loops. I tested this running in external RAM and
adjusted the delay until I liked the rate at which the multiple digit
codes are held on the LEDs.
The first time I ran it out of flash, I was amazed at how much faster
it was. I actually had to go back and double the delay loop counts to
make sure that the digits stayed on long enough for a person to write
them down.
I've pasted a snipped of code below, but note that these are simple
delay loops that translate into short stretches of a few instructions
between jumps, and jumps presumably flush the flash pipeline.
Still, I would suggest making every effort into copying your highest
speed interrupt service routines into internal RAM, and assigning any
static data variables they use in internal RAM as well. You can
define internal RAM segments and put code and data in them on a
function-by-function or file-by-file basis. It's pretty easy, there's
a TI app note that covers it pretty well.
One of our boards, that drives up to four brush DC motors
simultaneously, has two interrupts per PWM cycle at a 20 KHz PWM.
That's one interrupt every 25 microseconds, and a complete velocity
control PID look gets run periodically during one of those interrupts.
Here's the code snippet, watch out for word wrap:
for ( ; ; )
{
CPLD_Regs.CPLD_SvnSeg.all = leds [0];
for (ticker = 0; ticker < 200000; ++ticker) {
GPIOTOGGLE_WDOG_PUNCH = 1; }
CPLD_Regs.CPLD_SvnSeg.all = 0;
for (ticker = 0; ticker < 50000; ++ticker) {
GPIOTOGGLE_WDOG_PUNCH = 1; }
CPLD_Regs.CPLD_SvnSeg.all = leds [1];
for (ticker = 0; ticker < 200000; ++ticker) {
GPIOTOGGLE_WDOG_PUNCH = 1; }
CPLD_Regs.CPLD_SvnSeg.all = 0;
for (ticker = 0; ticker < 50000; ++ticker) {
GPIOTOGGLE_WDOG_PUNCH = 1; }
CPLD_Regs.CPLD_SvnSeg.all = leds [2];
for (ticker = 0; ticker < 200000; ++ticker) {
GPIOTOGGLE_WDOG_PUNCH = 1; }
CPLD_Regs.CPLD_SvnSeg.all = 0;
for (ticker = 0; ticker < 50000; ++ticker) {
GPIOTOGGLE_WDOG_PUNCH = 1; }
CPLD_Regs.CPLD_SvnSeg.all = leds [3];
for (ticker = 0; ticker < 200000; ++ticker) {
GPIOTOGGLE_WDOG_PUNCH = 1; }
CPLD_Regs.CPLD_SvnSeg.all = 0;
for (ticker = 0; ticker < 50000; ++ticker) {
GPIOTOGGLE_WDOG_PUNCH = 1; }
CPLD_Regs.CPLD_SvnSeg.all = leds [4];
for (ticker = 0; ticker < 200000; ++ticker) {
GPIOTOGGLE_WDOG_PUNCH = 1; }
CPLD_Regs.CPLD_SvnSeg.all = 0;
for (ticker = 0; ticker < 50000; ++ticker) {
GPIOTOGGLE_WDOG_PUNCH = 1; }
CPLD_Regs.CPLD_SvnSeg.all = leds [5];
for (ticker = 0; ticker < 200000; ++ticker) {
GPIOTOGGLE_WDOG_PUNCH = 1; }
CPLD_Regs.CPLD_SvnSeg.all = 0;
for (ticker = 0; ticker < 400000; ++ticker) {
GPIOTOGGLE_WDOG_PUNCH = 1; }
/* slow down to minimal speed after the first time */
EALLOW
SysCtrlRegs.PLLCR = 0; /* 1/2 Xtal Speed = 14.000 MHz */
EDIS
}
You can see that loops aren't optimized for pipelining at all, but
they still run much faster than RAM.
--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~ajo/docs/FAQ-acllc.html
.
- Follow-Ups:
- Re: Performance and Flash Pipelining on TI 28F12 DSPs
- From: Roberto Waltman
- Re: Performance and Flash Pipelining on TI 28F12 DSPs
- References:
- Performance and Flash Pipelining on TI 28F12 DSPs
- From: Roberto Waltman
- Performance and Flash Pipelining on TI 28F12 DSPs
- Prev by Date: Re: What is the next technology revolution ?
- Next by Date: .act audio file format
- Previous by thread: Performance and Flash Pipelining on TI 28F12 DSPs
- Next by thread: Re: Performance and Flash Pipelining on TI 28F12 DSPs
- Index(es):
Relevant Pages
|
|