Re: Recode to Play MP3?
- From: "Michael J. Mahon" <mjmahon@xxxxxxx>
- Date: Mon, 28 May 2007 00:26:05 -0700
mdj wrote:
On May 27, 7:59 pm, "Michael J. Mahon" <mjma...@xxxxxxx> wrote:
That's certainly true :-) Ok, so I devised a small 'memory bandwidth'
test which simply copies then restores the zero page to an arbitrary
memory page 256 times, multiplied by the value in memory location $08.
The destination page is set in locations $06-$07.
With the iteration counter in $08 set at $10, it's moving a megabyte
of data to the specified location and back again.
So: 2*256*256 = 128KB of data,
128*16 = 2MB total moved.
Unaccelerated Apple II: 34 sec
Transwarp Apple II to location $6000: 10.1sec
Transwarp Apple II to location $2000: 11.2sec
So the Transwarp shows a 9.8% improvement in this test when not
writing to a 'write-through' memory location.
It would be interesting to see how the Zip Chip fares in this test at
both 8Mhz and 4Mhz, and if its programmable speed can get it
reasonably close to 3.6Mhz (which the Transwarp runs at) it would be
quite interesting to see this too.
I verified your 34 sec unaccelerated number, and tried the
Zip Chip at 8MHz. Its time is 4.3sec, independent of address.
Well now, that's a lot faster than I was expecting - I was
anticipating somewhere around 7sec ...
I tried to get a 4MHz number, but couldn't get ZIP.SYSTEM to
set the speed for some reason (I've never used it at other than
8MHz and disabled).
As for getting closer to 3.58MHz, the closest multiple offered
(that I couldn't get it set to ;-) was 3.333MHz, so that comparison
won't be very enlightening.
It would be good to get a point at 4MHz to see where the intercept is,
but, as I said, I can't seem to get it set to any speed but "full on".
A straight linear extrapolation to 3.58MHz would be 9.6sec.
Well this test was 'designed' to focus on writeback speed, and it
would seem that the loops have to be a lot tighter to cause the Zip to
stall at 8Mhz, so I concur you're correct that a linear extrapolation
is quite accurate. Of course, it may be different if there's cache-
line contention going on, like in a large block copy, but I
deliberately left that out to focus on writes. An alternative test
might do the same thing by moving 16kb of data back and forwards from
AUX to MAIN memory using AUXMOVE (which is quite slow too since it
uses the stack to 'bounce' the data between banks)
Any linear move greater than 8KB will trash the Zip Chip's 8KB cache,
and every load will be a miss.
Your code is "perfect" for the cache, since the first page move "warms
up" the cache, and after that, no cache misses occur. The data pages
are not aligned (mod 8K) with the code, so no thrashing ever occurs
either.
Each byte of the code and each byte of the data page will be faulted
into the cache exactly once, and after that, everything runs from
cache. Since the write buffer can handle one byte every 8 cycles
(at 8MHz), and the loops only move 1 byte every 15-16 cycles, there
is never any waiting for memory after the cache is initially warmed.
The bottom line is that the Zip write buffer is very effective in
reducing write stalls, even at 8MHz, since code would have to store
more than one byte per 8 cycles (at 8MHz) to create a stall. This
is not common in 6502 code, particularly loops.
Since you move the data back and forth, the page last written is
the page next read--perfect. Moving from page 0 to a page 0 mod 8K
away conflicts perfectly, so if data were *repeatedly* moved from
page 0, but not back and forth, every load would be a cache miss!
Doing that experiment, with the second copy loop changed to re-copy
page 0 to $20xx, results in a time of 8.1sec--almost twice as long.
In this case, the cache only benefits code fetching, not data access.
But since most 6502 activity is code fetching, it's still pretty fast!
But, of the 4.3sec time at 8MHz, 2.06sec is spent doing 2MB of stores
at 1MHz, which is independent of Zip clock speed (since it's always
write-through). That leaves 2.24sec of accelerated time at 8MHz.
If this accelerated part were run at 1MHz, it would take 8 times longer,
or 17.92sec, which when added to 2.06sec gives a total of only 19.98sec.
Therefore, the 2MB of writes, even though they take a full cycle to
process, are not interfering substantially with the high speed execution
of the Zip Chip. It undoubtably has a 1-byte "write buffer" that allows
it to keep running unless another byte is written before the buffer is
emptied into Apple RAM.
Since only half the store bandwidth is used by these loops, it never
stalls for writes, and the linear extrapolation is closer to the truth.
Agreed. Well this is interesting, as it shows that the write-through
mechanism of the Zip is not only asynchronous, but buffered too,
perhaps only by 1 byte, but still works quite well.
The numbers from the Transwarp indicate that its synchronous design
running to 3.6Mhz RAM doesn't scale as well as the Zips cache design
does, which I didn't expect. If the linear extrapolation you mentioned
holds true at 3.6Mhz, it would seem the Transwarp is actually slightly
underperforming, either due to DRAM refresh getting in the way, or
DRAM speed limitations. Or it could simply be that the Zips 'lazy'
write mechanism from cache->memory is a big performance win.
Most interesting of all, it indicates that 8Mhz is not the ceiling
speed for the Zip acceleration architecture either. It would appear
even 16Mhz would work quite nicely under this test.
Yes, that surprised me, too. I had not appreciated the write buffering
of the Zip, so my past estimates of Zip code execution were pessimistic.
Of course as mentioned, the Zip never has to reload a cache line under
this test, so perhaps bus contention of loading the cache *and*
writing would expose a limitation, but that requires a different
test...
Right. Or moving the data block to $x300 for even x... ;-) Then there
will be 8/256ths of each loop that runs slower as the direct-mapped
cache is thrashed by the data/code interference.
That runs in 4.7sec--0.4sec slower.
On the average, caches work. Larger caches work better (as long as they
remain fast). Caches with wider associativity work better (as long as
they remain fast).
Most processor time is spent in relatively tight loops, with very modest
code working set sizes, and with data working set sizes varying from
small to all of memory.
Using a small, fast memory plus control to make a large, slow memory
appear fast is a great trick that works "most of the time". ;-)
-michael
NadaNet file server for Apple II computers!
Home page: http://members.aol.com/MJMahon/
"The wastebasket is our most important design
tool--and it's seriously underused."
.
- Follow-Ups:
- Re: Recode to Play MP3?
- From: mdj
- Re: Recode to Play MP3?
- References:
- Re: Recode to Play MP3?
- From: mdj
- Re: Recode to Play MP3?
- From: Michael J. Mahon
- Re: Recode to Play MP3?
- From: mdj
- Re: Recode to Play MP3?
- From: Michael J. Mahon
- Re: Recode to Play MP3?
- From: mdj
- Re: Recode to Play MP3?
- From: Michael J. Mahon
- Re: Recode to Play MP3?
- From: mdj
- Re: Recode to Play MP3?
- Prev by Date: Re: One for the hardcore collectors
- Next by Date: Re: One for the hardcore collectors
- Previous by thread: Re: Recode to Play MP3?
- Next by thread: Re: Recode to Play MP3?
- Index(es):
Relevant Pages
|