Re: CPU benchmark for Xilinx PAR
- From: "JJ" <johnjakson@xxxxxxxxx>
- Date: 13 Sep 2005 01:20:41 -0700
Very interesting
I really doubt its the branch behaviour even though the Athlon series
has always been good on office type twisty apps. For branchy code
segments that fit in the I cache, these days the branches almost come
for free and guess right more often than not.
I'd hazard a guess it has more to do with the data set being very large
and missing the L1, L2 and TLBs way too often, "poor locality of
reference" , even 1% misses, maybe less maybe enough to wreak havoc.
It not difficult to create a simple data structure that holds millions
of items in a hash table and see even an Athlon xp2400 give up 300ns
avg accesses to each entry if all accesses appear random.rather than
the naive 1ns its L1 cache can actually do.
You can plot a graph of open random address width from 6bits to 24bits
and watch execution time go from 1n to 4ns and then roughly stepping
30ns 100ns 300ns for x[i] when i is coming from any old random no
generator and masked by width field. Measured on an xp2400.
If this simple test were run on various cpus, we could see how the
caching really works for graduating locality disaster cases and choose
accordingly.
Now EDA software doesn't deliberately do this, but might get some of
the same effect unintended simply by having to walk immense graphs and
trees. Think about it, draw a graph with millions of nodes and try to
label in such a way that it can be traversed with mostly low address
bit changes (high locality) when the nodes in the graph are allocated
completely in random fashion. Then think, how many operations actually
get performed on each link list traversal, a lot of the time it might
be just passing through looking for something, the worst possible
situation, all fetch no work.
I don't imagine there is much EDA code that looks like beautiful DSP
media codec stuff with super straight line high locality SSE tuned
code.
I could be all wrong, but I thinks it the Memory Wall effect and the
Opteron maybe does a better job of recovering. That also means a cpu
that concentrates on that aspect desn't even need a clock advantage as
long as it tolerates poor locality better.
I wonder if its possible to get stats from the cpu performance hardware
that shows what the cpu is really doing in memory, bit out of my
league.
I wonder if the EDA guys just crank out code or do they ever measure
algorithms on different x86 hardware at the cache level, curious?
I also wonder how much FPU is actually used and how so?.
On a threaded cpu designed to work with threaded memory where there is
little memory wall (latency tolerence all around), it doesn't take much
hardware to design a processor element in FPGA that can match Athlon
xp300, and 10 or so ganged together can then match xp3000 but you get
40 odd threads to fill instead of waiting on cache misses. Me, I'd
rather fill the threads (occam style) than wait, but most are not of
that opinion (yet).
Now if EDA ever becomes highly concurrent, (some have done this in VLSI
EDA from simulation to P/R) it does make possible some real speed ups
when real threading becomes pervasive in cpus (not this 2,4 thread
nonsence).
johnjakson at usa dot ...
transputer2 at yahoo dot ...
.
- References:
- CPU benchmark for Xilinx PAR
- From: Paul Gentieu
- CPU benchmark for Xilinx PAR
- Prev by Date: Re: Post synthesis simulation errors
- Next by Date: Re: P&R speed higher than synthesis
- Previous by thread: Re: CPU benchmark for Xilinx PAR
- Next by thread: Re: CPU benchmark for Xilinx PAR
- Index(es):
Relevant Pages
|
Loading