Re: Best FPGA for floating point performance
- From: "JJ" <johnjakson@xxxxxxxxx>
- Date: 29 Aug 2005 06:50:10 -0700
mk wrote:
> On 28 Aug 2005 16:49:15 -0700, "JJ" <johnjakson@xxxxxxxxx> wrote:
> >
> the problem with highly threaded cpus is that they are not very good
> at running wordprocessors, spreadsheets and fpga p&r tools and that's
> where most cpus are used so the two cpu developers put most of their
As long as almost all favorite software languages don't support
concurrency or even worse have a broken early 1960s lock based model
(java,C#) then its a darn good job most apps are not multi threaded,
they would all be broken as Java Swing was shown to be. But with
languages like CSP/occam which can be added to C++,Java its much more
practical to do so but there hasn't been any compelling reason to do
this SO FAR. I hear that C++ may finally be getting some official
concurrency added in coming round. ADA though has been on solid ground
for 20yrs and coincidently shares some of Occams view of cooperating
processes with its rendezvous.
For office processing, it really doesn't take more than a few Mips to
do the raw grunt work, its the GUIs and bloatware that is killing most
SW performance today as well as the cpus memory systems that don't like
anything but extreme locality of reference. Most applications like
Word, OpenOffice are really databases with complex data structures that
can not fit into any cache. They are not so different in principle from
EDA databases, lots of hash tables, trees, lists, graphs, and all of
these have poor locality of reference. Hash tables are the worst, they
have completely random behaviour and are my favourite data structure
(being associative). On paper a 20 (10ns) instruction hash entry
actually takes 1000 cycles every time (if table is > cache).
I think that folks who use SW but don't write it are living under the
false illusion spread by Intel/AMD marketing that all their opcodes run
at 2 or 3GHz and that with the wonders of Superscaler and Out of Order,
they run several opcodes per cycle. In practice they hit the memory
wall all the time. When locality is present, they are amazing though,
and multimedia codecs are esp good at this.
Programs that manage large databases esp such as those for VLSI and
also FPGA P/R have no locality at all. If the database can't fit into
L2 cache, the cpu is broken. I know of no suitable CS data structures
that are extremely cache friendly that can substitute for those above
and described by Knuth. Most CS data structures were imagined (in the
60s or earlier) when memory cycles were similar to processor cycles.
The memory model I propose (at cpa2005) not only has a fairly flat
access time accross its entire DRAM space, its issues about as fast as
your typical L2 SRAM but has some latency and multi threads as its
price.
Would you rather have 1 thread cpu that lives in a cache prison of 1ns
at 16K plus a 4ns 512K backyard plus >100ns nGB wasteland with always
too few TLBs and missing more often than not. A 1% miss means 1ns +
N*100ns/100 avg accesses. More than 1% misses means several ns avg
L1,L2,DRAM accesses. A 3% miss rate probably means 250Mips of loads and
stores plus a few times more for all the free opcodes inbetween that
don't hit memory.
Or would you rather have 4N threads all of which see something like a
2-4 instruction DRAM access even if threads are each much slower. In an
90nm ASIC process the FPGA design I describe could run each thread at
peak 200-250Mips each. If 1 thread isn't enough to run OpenOffice, then
time to find another package.
The real limit to computing is memory throughput, period. Single
threaded cpus make it worse by mostly serializing these through single
memory management unit. Multithreaded cpus with same memory model make
it even worse because the locality is divided further.
But threaded processing with threaded memory can achieve far higher
sustained memory bandwidth and thats really all that matters. After
that its pretty easy to attach Mips to memory issues (but theres more
to it than that).
> money into developing slightly threaded architectures with full
> multi-cores instead of smt. if you noticed the new multi-core i86
> implementations don't support ht anymore. another reason this idea is
> not very easy to implement is that regardless what's happening with
> power on 90nm and lower processes, speed still counts and embedding
> dram on a high speed logic process is a big problem. if and when a new
> memory structure comes out which can be embedded in logic process and
> as inexpensive as 1t+1c dram, i am sure isa architects will look at
> highly threaded cpus again but probably not before then. also keep in
> mind that developing cpus is a very expensive endevour and anyone who
> is not developing an x86 compatible one seems to be giving up
Well I don't actually suggest any DRAM of any sort should be on die
with cpu. What I do suggest is RLDRAM interface to off the shelf parts.
Further I suggest smaller and larger models of the same threaded memory
architecture, the smaller one in interleaved SRAM inside cpu at full
cpu speed (ie a L1 cache again) and larger one for much slower DDR DRAM
for lower cost. In effect a 3 level memory all threaded to give high
level of associativity at different speed size points. No paging, no
TLBs. Page size is only 32bytes though.
> including intel. i expect sun will drop sparc pretty soon and intel
> will drop itanium too; moving their itanium developers to xeon
> projects doesn't bode too well which is for the better as we won't
> have to deal with a completely proprietary, fully moated with patents
> isa.
I don't think Sun will be dropping Sparc at all exept the older models,
Sparc in its Niagara form will serve their needs better than any x86,
and Opterons for their other customers. Niagara and RMI MIPs are very
similar to what I propose but they don't have the same memory model I
suggest.
Itaniums, I guess 50-50, don't really care. The world of comp arch is
not as sterile as an Intel only desktop world would have us believe.
The embedded space and the much smaller HPC (thank heavens) is entirely
more appreciative of engineering. I for one don't believe x86 is as
important as most believe, when you have been around 30yrs in the
business, everything is old and tired and Windows is looking very
tired. 99% of x86s get used solely for surfing and light office work
and almost all of these are idle 98% of the time. If you take a WinCE
toy and add a bit more RISC grunt and video output to it, what you
still have is the familiar Windows but not on x86. Eventually people
will tire of 100W heaters.
The workstation model died because it tried to hitch a free ride on x86
coat tails. If you want to use a PC to do FPGA P/R that is barely good
to surf the web or run bloatware, thats something we did to ourselves.
I do believe that ASIC & FPGA EDA could benefit enormously by
threading, its been done in ASIC EDA for years in some products.
end of rant
johnjakson at usa ...
transputer2 at yahoo ...
.
- References:
- Best FPGA for floating point performance
- From: Marc Battyani
- Re: Best FPGA for floating point performance
- From: Austin Lesea
- Re: Best FPGA for floating point performance
- From: JJ
- Re: Best FPGA for floating point performance
- From: Austin Lesea
- Re: Best FPGA for floating point performance
- From: Thomas Womack
- Re: Best FPGA for floating point performance
- From: robin.bruce@xxxxxxxxx
- Re: Best FPGA for floating point performance
- From: JJ
- Re: Best FPGA for floating point performance
- From: mk
- Re: Best FPGA for floating point performance
- From: JJ
- Re: Best FPGA for floating point performance
- From: mk
- Best FPGA for floating point performance
- Prev by Date: Re: Best FPGA for floating point performance
- Next by Date: Re: digilent spartan 3 kit example project
- Previous by thread: Re: Best FPGA for floating point performance
- Next by thread: Re: Best FPGA for floating point performance
- Index(es):
Loading