Re: FPGA-based hardware accelerator for PC




Phil Tomson wrote:
In article <1146975146.177800.163180@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>,
JJ <johnjakson@xxxxxxxxx> wrote:


snipping


FPGAs and standard cpus are bit like oil & water, don't mix very well,
very parallel or very sequential.

Actually, that's what could make it the perfect marriage.

General purpose CPUs for the things they're good at like data IO,
displaying information, etc. FPGAs for applications where parallelism is
key.


On c.a another Transputer fellow suggested the term "impedance
mismatch" to describe the idea of mixing low speed extreme parallel
logic with high speed sequencial cpus in regard to the Cray systems
that have a bunch of Virtex Pro parts with Opterons on the same board,
a rich mans version of DRC (but long before DRC). I suggest tweening
them, puts lots of softcore Transputer like nodes into FPGA and
customize them locally, you can put software & hardware much closer to
each other. One can even model the whole thing in a common language
designed to run as code or be synthesized as hardware with suitable
partitioning, starting perhaps with occam or Verilog+C. Write as
parallel and sequential code and later move parts around to hardware or
software as needs change.

I think the big problem right now is conceptual: we've been living in a
serial, Von Neumann world for so long we don't know how to make effective
use of parallelism in writng code - we have a hard time picturing it.

I think the software guys have a huge problem with parallel, but not
the old schematic guys. I have more problems with serial, much of it
unnecessary but forced on us by lack of language features that forces
me to order statements that the OoO cpu will then try to unorder. Why
not let the language state "no order" or just plain "par" with no
communication between.

Read some software engineering blogs:
with the advent of things like multi-core processors, the Cell, etc. (and
most of them are blissfully unaware of the existence of FPGAs) they're
starting to wonder about how they are going to be able to model their
problems to take advantage of that kind of paralellism. They're looking

The problem with the Cell and other multicore cpus, is that the cpu is
all messed up to start with, AFAIK the Transputer is the only credible
architecture that considers how to describe parallel processes and run
them based on formal techniques. These serial multi cpus have the
Memory Wall problem as well as no real support for concurrency except
at a very crude level, it needs to be closer to 100 instruction cycles
context switches to work well, not 1M. The Memory Wall only makes
threading much worse than it already was and adds more pressure to the
cache design as more thread contexts try to share it.

for new abstractions (remember, software engineering [and even hardware
engineering these days] is all about creating and managing abstractions).
They're looking for and creating new languages (Erlang is often mentioned
in these sorts of conversations). Funny thing is that it's the hardware
engineers who hold part of the key: HDLs are very good at modelling
parallelism and dataflow. Of course HDLs as they are now would be pretty
crappy for building software, but it's pretty easy to see that some of the
ideas inherant in HDLs could be usefully borrowed by software engineers.



Yeh, try taking your parallel expertise knowledge to the parallel
software world, they seem to scorn the idea that hardware guys might
actually know more than they do about concurrency while they happily
reinvent parallel languages that have some features we have had for
decades but still clinging to semaphores and spinlocks. I came across
one such parallel language from U T Austin that even had always,
initial and assign constructs but no mention of Verilog or hardware
HDLs.

But there are more serious researchers in Europe who are quite
comfortable with concurrency as parallel processes like hardware, from
the Transputer days based on CSP, see wotug.org. The Transputers
native language occam based on CSP later got used to do FPGA design
then modified into HandelC so clearly some people are happy to be in
the middle.

I have proposed taking a C++ subset and adding live signal ports to a
class definition as well as always, assign etc, starts to look alot
like Verilog subset but using C syntax but builds processes as
communicating objects (or modules instances) which are nestable of
course just like hardware. The runtime for it would look just like a
simulator with an event driven time wheel or scheduler. Of course in a
modern Transputer the even wheel or process scheduler is in the
hardware so it runs such a language quite naturally, well thats the
plan. Looking like Verilog means RTL type code could be "cleaned" and
synthesized with off the shelf tools rather than having to build that
as well and the language could be open. SystemVerilog is going in the
opposite direction.

snipping

That PCI bus is way to slow to be of much use except for problems that
do a lot of compute on relatively little data, but then you could use
distributed computing instead. PCIe will be better but then again you
have to deal with new PCIe interfaces or using a bridge chip if you are
building one.

Certainly there are classes of problems which require very little data
transfer between FPGA and CPU that could work acceptably even in a PCI
environment.


The real money I think is in the problem space where the data rates are
enormous with modest processing between data points such as
bioinformatics. If you have lots of operations on little data, you can
do better with distributed computing and clusters.



snipping


One wonders how different history might be now if instead of the serial
Von Neumann architectures (that are now ubiquitious) we would have instead
started out with say, cellular automata-like architectures. CAs
are one computing architecture that are perfectly suited for the
parallelism of FPGAs. (there are others like neural nets and their
derivatives). Our thinking is limited by our 'legos', is it not?
If all you know is a general purpose serial CPU then everything starts
looking very serial.


I was just reading up on the man, a far bigger "giant" in history than
the serial Von Neumann computer gives him credit for which I never knew
to my shame. The legacy stands because the WWII era didn't have too
many tubes to play with so serial was the only practical way.

(if I recall correctly, before he died Von Neumann himself was looking
into things like CAs and NNs because he wanted more of a parallel architecture)

There are classes of biologicially inspired algorithms like GAs, ant
colony optimization, particle swarm optimization, etc. which could greatly
benefit from being mapped into FPGAs.

Phil

Indeed

John Jakson
transputer guy

Transputers & FPGAs two sides of the same process coin

.



Relevant Pages

  • Re: FPGA-based hardware accelerator for PC
    ... FPGAs and standard cpus are bit like oil & water, don't mix very well, ... General purpose CPUs for the things they're good at like data IO, ... you can put software & hardware much closer to ... One can even model the whole thing in a common language ...
    (comp.arch.fpga)
  • Re: Best FPGA for floating point performance
    ... FPGAs on the other hand are typically bound by peak ... > operations because they have higher peak FLOPs/s, ... I think that both FPGAs and Multi cpus could go through some serious ... FPGA with cpu components like FPU, we would end up in a more similar ...
    (comp.arch.fpga)
  • Re: EHLO, board designers
    ... libraries that help programmers make use of the co-processing features. ... keep in mind some of the main downfalls of any hardware based co-processor: ... specialized hardware than to just burn CPUs cycles to perform the search. ...
    (comp.arch.fpga)
  • Re: OT: CPUs
    ... their cpus to obtain the best performances with current software. ... strategy but rather yields oscillating trends: (complexity in hardware ... The classic example is coevolution between predator and prey species. ...
    (talk.origins)
  • Re: Estimating number of FPGAs needed for an application
    ... Run the FFT core from there. ... boss consult a hardware engineer. ... I have to estimate the number of FPGAs ... 16k complex vector multiplication ...
    (comp.arch.fpga)