Re: Another transputer-inspired language?





James Harris wrote:
"JJ" <johnjakson@xxxxxxxxx> wrote in message
news:1139736029.946325.10390@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

snip
Please lets not reinvent something that has already been done well for
30-40 years but is largely unknown about on the software side of the
fence.

snip

Before I continue, its worth keeping an eye on comp.arch.fpga too for
the discussions on the same theme mostly from the FPGA designers point
of view using various xyxC and Verilog or VHDL as languages used to
express systems that include components both soft and hard ware like.
Strict HDLs are totally static since they were originally built to
describe permanent ASIC/VLSI hardware devices. FPGAs can also be fully
or partially reconfigured to allow "multi programming" as in an OS but
the HDLs don't support this at all so far. Infact its getting worse
since the bit file formats are very proprietary and hard to get much
info on, there was/is JavaBits. The days of open EDA tools for FPGAs
has gone away since one wants to take advantage of the mega blocks
built into FPGAs such as BRAMs multipliers and SRL16s and only the
synthesis tools from the vendors have this detailed knowledge but they
only support Verilog & VHDL.

Ironically the people driving the future of Verilog/VHDL are sort of on
a sinking ship ie the no of new ASICs being designed is rapidly
shrinking but the no of FPGA starts is exploding due to lowish NRE
upfront costs. Perhaps the FPGA people need to have more input to
language direction to allow for reconfiguration and mixed hardware
software idea.


What happens when you can put 1000, or 10K PE nodes assuming the heat
&
power can be managed. FPGAs are following Moores law very nicely
indeed
when it comes to packing density as they are most like memory arrays
but the cycle speeds are only going up slowly.

The very simplistic way I see it is that:

Some problems are inherently parallel and require not just arrays of
like nodes but communication channels to pass values between the nodes.
Even if massively parallel systems can be fabricated and cooled etc I
wonder if the communication system would end up being either too rigid
or too slow or both. Too rigid if it only accommodates fixed patterns of
node connections, thus leading to only a few nodes being usable for a
given problem. Too slow if comms has to go via blocking queues or other
nodes. Perhaps a crossbar switch could be implemented but wouldn't it
take up a serious slice of the silicon?


True, again pure HDL languages have no concept of new or delete
instance of hardware. When I suggest the V++ language be modeled after
Verilog but using C syntax, I also allow for vars of type process xyz
which can then be allocated and deleted on demand. A process instance
that is just named in the usual verilog sense, is instanced the moment
the parent object is loaded into the FPGA and stays until power down.
But a function that runs in a thread on a Transputer node may be
allowed to instance a block at any time. In the code sense this is just
like forking and having the kernal add new threads to the schedule
list. In the hardware sense it means also passing the instance module
back through synthesis, place & routing and finding a place in the FPGA
to put it and that brings up swapping, and OSes etc. I am not sure how
far we can go with this since dynamically loaded modules all of the
same instance type usually should be a copy of each other but that may
not be possible and may give different timings, so this would be a real
memory/fabric no virtual system.

FPGAs are very good at crossbars by definition since wires are almost
free. They have considerable delays and impose wiring restrictions so a
massive Transputer array might use a nearest neighbour and or use
additional over wiring to shorten long message hops but these may well
complicate a placement, it all depends on how much wiring is unused by
the generic design. FPGA intrinsically include nano, micro, mili wiring
lengths for just this reason. Obviously the longer wire nets are
scarcer resources than the shorter wire nets.


Some problems are not parallel at all and have to be dealt with
sequentially.


And that means that atleast 1 processor node has to be in the hardware
fabric to run seq software and manage the hardware resources too as
well as any runtime loader needed to place presynthesized hardware.

Occam makes seq/par mixing so easy, most languages are almost entirely
1 or the other. But Verilog/VHDL do allow seq functions too but these
are not usually synthesizeable to hardware but run during design
testing and verification in simulation to test other blocks intended to
be hardware. These seq parts of the language often have hooks to allow
external C calls but look clumsy to me (PLI interface etc).

There is a middle ground where the task can be divided to 'micro'
processes and then, perhaps, process-farmed to nodes. The component
processes to not need to be identical.


Yes, the nice thing about puttuing 1 or very many Transputers into FPGA
is that you can select at any time whether to run code as hardware by
synthesis or as code by running on the Transputer scheduler if only
there was a language suitable for both sides. Occam is too unknown on
the software side and not really useful to synthesis even though
ESL/Celoxica did just that. Now perhaps one of the xyzC languages might
be good enough (FPGAC, StreamC, handelC etc etc) but I fear not
although I will have to check them out. Also the language must be in
the public domain which rules out most of them. These languages were
concocted to allow C programmers to describe hardware problems in C
syntax without getting into the gritty details of low level hardware
design and I believe often perform badly on clock frequency and
resources used although they promise rapid turnaround which is not
really enough to cover FPGA costs.

The functions of processing and communication are orthogonal to each
other and are fundamental to the art/science of computing. Occam had the
beautifully simple in and out constructs but didn't provide buffering.
IIRC this allowed mathematical proof of correctness and absence of
deadlocks. However, it seems that the real world requires more
flexibility. The problems that brings may be difficult to solve but once
solved the solutions can be applied to similar patterns.


Hardware language are also formally manipulated by synthesis but don't
have the CSP academics to support them but there is alot of similar
abstract math involved in mapping HDL expressions into optimal logic,
its just not much known about outside the synthesis community. Demorgan
rules for and_or seem similar to seq_par transformations.

Also the hardware people are plenty comfortable with metastability,
sampling, multiple clock domains, buffers etc and you don't really hear
about deadlocks until software is involved. EEs know all about MTTF and
the kind of testing that allows hard nos to be put into specs the likes
of which software will never see.

As for how how the processes are placed on nodes it seems that there are
some options. The key is dividing the problem into small processes. Once
a problem has been thus subdivided the subprocesses can run in parallel
on different nodes, they can run timeshared on one node, or can be a
mixture of the two. They can even be built into the same address space
if the communication is implemented by subroutine calls. As with one of
the sweetest features of the Transputer, the application code can be
identical. Only the implementation of the communication part needs to
differ.


The method of process placement is handled in much the same way ASICs
use floor planning and cell placement with current EDA tools. Either
they are manually placed or automatically placed, the goal is to reduce
the comms cost or signal delays while not congesting the work load onto
1 node. This might involve auto repartitioning of the process
hierarchy, if it looks like static Verilog, its an already solved
problem.

The model of communication I propose in the Transputer I describe in
the paper is to use the MMU for both local and neighbour storage
requests. Every object has a name of 32bits or so and can index a 32b
space through a hash function onto the local store. The name includes a
Transputer-MMU node ID as well as object ID. So a 256 way setup would
use 8 bits for node, and 24 for local object space. That means that all
processes on any Transputer see a flat object index space but have
immediate fast access to their own memory on their MMU and relatively
slower access to objects far away. The MMU handles local memory request
as normal but passes offnode accesses as message through adjacent links
wormhole fashion with some expected latency while stalling the PE that
accessed the data. The Transputer is now really defined mostly by the
MMU since the local processor elements mainly execute simple register
codes. Infact the PE instruction set doesn't even matter, could be Arm,
x86 or specially optimized for FPGA processor. Occam like codes will
usually involve the MMU since scheduling and communication involve
block moves.

snip
I would like to see a subset of C++ pick up a subset of Verilog so
that
structs can grow into classes and then into processes that can be
logically arranged as easily as Verilog modules can. Instead of

module mname ( <port list decl> ) beginmodule ... endmodule

I would just use

process pname (<port list decl>) { ... }

in the usual C style. Its the
port list that enables true massive concurrency since event driven
signals can now be routed statically around a process hierarchy and
provides liveness.

You mean 'ports' as communication endpoints - so as long as a sender can
address a port then that port can be implemented/received anywhere?


Verilog encourages instances to be wired to neighours as much as
possible, signals may propagate through a parents port list and so on
up the hierarchy and back down, but only as expressed through the
wiring. Tri state or multiplexed busses carry traffic in preplanned
paths. Verilog also allows the source to wire pt to pt directly through
the hierarchy without having to drill the path through all the nesting
by specifying paths just like postal address using the dot separator.
Anyway the source has its hierarchy smashed after parsing so the result
is the same. Mixing this with dynamic instancing of modules or
processes would perhaps mean that new'ed instances can only be attached
to existing bus wiring adding more loads onto master busses.

The ... block content would also include the
continuous and always assignments as well variable and signal
declarations and also process instances as well as the familiar C++
methods and some of C++ OO ideas. I am not sure how much OO is
relevant, but localising variables, methods, processes is needed, dump
templates, exceptions, multiple inheritance, etc. Please take a look
at
Verilog to see what I mean.

OK. I'll take a look at it. I have one book on order. If someone can
recommend a simple book of Verilog concepts (i.e. suitable for the
simple software engineer rather than someone with a wealth of hardware
experience) I would appreciate it.


Actually just google for some Verilog/VHDL tutorials, there is tons of
it out there, Sutherland and a few names stand out. For books, Verilog
is far less common on the bookshelf than VHDL since its more taken up
by EEs more than students. VHDl is more akin to ADA. Thomas & Moorby,
Palnitkar are good.

Also visit Xilinx or Altera for Verilog tutorials too, they pretty much
tell you how to write code (V/V) to get the device structure you want,
something that the xyzC guys can never do so easily as outsiders


When I think of how to use 1000s of Transputer nodes, I see no
practical alternative to Verilog + C+ or V++ as I call it. Right in
the
early Inmos literature, it said occam processes model hardware so
naturally a hardware description language with general programming
extensions is the right way to go. I am not quite so sure how to
handle
dynamic process creation as this is alot like self modyifying hardware
but FPGAs have enabled this problem for some time but it has seen
little use so far.

If I follow...., I guess the problem is where to place new processes,
which depends on their computation requirements and what
bandwidth/latency they need to their communication partners.

In hardware design and also in occam, its usually pretty obvious to the
design engineer how to partition a problem into sub blocks and 'wire'
them up. In occam the code sort of runs as you write it but might get
some seq-par transforming too to minimize par when seq is equivalent to
reduce the overhead of scheduling. In HDLs the code is always hierarchy
smashed so it doesn't matter how the hierarchy is composed to the EDA
tool flow, and then code is synthesized and optimised and placed &
routed in multiple passes until it meets a timing and area design
constraint or not.

Does the
process model /have/ to model hardware or not. Cannot we just say that
the process model /can/ model hardware? And it can also model processes
that have nothing whatever to do with hardware.


Precisely. One can use plain old C to do generic seq software and I and
others use it also to describe plain old par hardware in a style called
register transfer level (RTL) that really boils down to 2 sets of
storage, set A and B. B <= func(A) alternate with A <= func(B); same
func on both odd even clocks. Another way is to use 2 funcs, 1 for the
logic and a dumb func simply to copy B back into A so the code looks
like A<=(A) with B hiding behind as a hidden buffer, ie master slave.
In Verilog the B slave is totally hidden so we only write lots of these
<= clocked assignments. A would then be all the register bits in the
entire chip. The <= means simultaneous assign the evaled values of rt
side to left side. So x <= x+1 is a counter etc.

The continuous form looks much more like C,
assign a = ... b.., c = ... x y z , etc. here all the right sides are
are individually evaled as and whenever any of the inputs changes so
the ordering of the expressions is unimportant. Thats the thing about
HDLs and also occam par, they force you to let go of unimportant
sequencing in code that has no value.

One could also use Verilog or VHDL to write software as well as
hardware, but they are very poor for software, they have no datatypes,
structuring etc.

By combining the useful event driven model from Verilog, and the basics
of C++ we can have a language that can describe seq & par processes at
will and is close enough to Verilog to synthesize but close enough to C
for run as code with the provided event driven timing wheel which is
included anyway with a Transputer core. Now it does mean an algorithm
would initially be explored in strictly seq C style with little chance
to synthesize as hardware and gradually get rewritten using the RTL
style and Verilog operators to be synthesizeable. While doing this RTL
code transform, the code will start to run a bit slower as more use of
the scheduler is made but the final RTL should run much faster than
most Verilog simulators since we can accept some simplifications like
single clock domains.

The other xyzC languages though would try to eliminate that RTL step
and allow to write only the behavoural code and they figure out how to
do the RTL step. Software guys might like that alot, hardware guys
don't since the tool may be doing what they were trained to do.
Software guys might argue its the same as the assembler wars of past
but the hardware guys always want detailed control for performance.

The reason I consider Transputing to be so relevant today is that it
is
simply the flip side of FPGAs, one runs processes on sequential cpus,
the other runs them on real (arguably) hardware so why not be able to
use the same language to describe algorithmic processes on hardware &
software. Better still why can't processes be described in a common
language that can then be partially synthesized at will into hardware
if need be, proto in V++ code and run as code, then tune and move into
HW as needed.

Is that not, at least partially, the same as my initial point! Build the
sysytem as software, test correctness, profile (don't guess) to identify
areas requiring speedup, then map to hardware and repeat from the
profiling step.


Yes, same thing although the the hardware guys usually want to drive
the mapping in detail by writing the RTL directly to a language that
can be synthesied without performance surprises. The software guys
would like it easier but far less work and far less performance, no
free lunch.

One of the reasons for actually caring about performance is that it is
quite easy to lose performance if you drive tools from a great
distance, such that just running the code on a P4 at 3GHz on a single
thread may actually be faster and cheaper than a massively parallel
design at 50MHz. Hardware pros can reach 150-300MHz for highly tuned
designs which may be just enough to overcome the high price of using a
hardware accelerator over plain old P4. The problem with FPGA
acceleration is that FPGAs are atleast 5-25 x slower than the logic in
your P4 but make up for it by massive parallelism. But for the same
amount of silicon and $, FPGAs only give you maybe 20th of the logic
density of real VLSI so 20x25 is a huge penalty to overcome. I have
seen papers of mersenne primes acceleration where they came out even
because the design actually ran at about 50MHz but was done by software
people or mathematicians, alot was left on the table. Despite this the
P4 has walled out and newer PentiumMs are now playing the same rules as
FPGAs, maximize core count rather than brute clock speed so the race is
now somewhat syncronized.

Now if we go further and replace a few overly complex cpus for lots of
very simple cores, the multi Transputer may actually still work out
better than turning code into logic since the processor is designed
once to run at say 150Mips and uses about 500LUTs. If an algorithm is
mapped onto hardware and runs at 50MHz and also uses 500LUTs also, it
can be compared to the same algorithm that uses 3 opcodes. But theres a
lot to consider besides,

In your original post you didn't mention hardware at all, but occam
made the leap into hardware synthesis and so would any of its
successors since processes model hardware.

John

.



Relevant Pages

  • Re: AES encryption of bitstream - is my design secure?
    ... that someone will find an affordable use for the hardware. ... "face recognition" as hardware algorithm in FPGA, ... Better for Design ... but are these really easier with a new bitstream attack? ...
    (comp.arch.fpga)
  • Re: Another crazy new language effort - Language #42
    ... If you specifically said "when compiled to hardware, run much faster than C", that may be easier. ... How would you allow users to extend the language? ... are several companies that can compile C in to Verilog, ...
    (comp.lang.misc)
  • Re: Another crazy new language effort - Language #42
    ... that was using a C based language to compile to HW. ... Prototyping FPGA circuitry is typically done on an FPGA emulator ... It's faster on the FPGA because it's actual hardware ... you could probably design some sort of compiler for your language that ...
    (comp.lang.c)
  • Re: Scientific Computing on FPGA
    ... FPGA solution. ... I suspect that as Flash drives replace hard drives at the 30GByte level ... the case for hardware taking control of data management in Flash only ... This is why the best hardware solution is unlikeley to ever be achieved ...
    (comp.arch.fpga)
  • Re: A chip too far? Where is your solution Mr Larkin?
    ... inconceivably vast task-switching overhead. ... configuring the hardware will be as loading a program into ... memory, you configure it ONCE for each program. ... With some tricks only part of the FPGA would be used. ...
    (sci.electronics.design)