Re: Chuck's plan



On May 29, 12:26 am, Wayne <news_putmynamehere...@xxxxxxxxxxxxxxx>
wrote:

So many questions. I thought they had all been answered multiple
times in the past. It is sometimes dangerous to answer a dozen
questions when someone asks because they may turn each answer
into a dozen more questions and other people often join in and
and ask the same questions again also. And one can get seriously
misquoted in the process.

I am interested what in particularly stopped it
from having an automated memory bus, rather than the software driven bus?

A decade ago the on-chip architecture was asymetric. F21
had a Forth CPU, a video I/O coprocessor, an analog I/O
coprocessor, a network router, realtime clock, and a memory
interface coprocessor. If you added memory you could make
a symetric Forth multiprocessor with multiple nodes. But
the design being asymmetric meant that each part of the
design was different, had to be designed separately, had
to tested separately and had to programmed separately.
Furthermore the memory interface being hard-wired could only
support whatever chip interface was chosen at design time.
This required predicting which memory chips will be most
available and most inexpensive in the future which isn't
always possible.

A design decision was made a decade ago to go to symetric
multiprocessing on-chip. Each node would be the same, so
there is only one processor to design not a half dozen
different ones for every chip designed. The idea was
that things done with dedicated hardware before would be
done with a Forth core and software this way.

On the memory side it means that the interface is more
flexible and can support a wider range of functions.
On the design side it means one design instead of
the thirty different designs Chuck did in the past.

I would have thought that it would not be too much of a strain, as only
one core needed it?

Once a decision is made to use a symetric design instead of
designing thirty different coprocessor systems then there is
the requirement that all core be the same. Having one node
be completely different and be like the design from twenty
years earlier would violate design goals and be too much
of a strain by definition.

It is like the group that insisted that a more powerful
router more independent and using less software for
interprocessor communication between nodes would
improve performance. People often ask, why not just
add this or that? An you can tell them that the thing
that they are describing makes each core ten times
as big and expensive and power hungry and isn't going
to give a ten times increase in overall performance.

I asked the above separately, because potentially being on one core, it is

It violates the concept of symetric design with a all core being
the same. Chuck already did thirty or forty different specialized
I/O coprocessor designs and wanted to have a single core design to
replace all that. Once you do down that road for a decade people
will still say that it would be easy to throw it all away and
go back and start going down the old road instead.

It is a little like the suggestion that what needs to be
done is to go back to programmable logic like more than
twenty years ago.

not such a bottle next to every other core, and can drag in data faster,
however, I am interested in how this played into the design?

I think the idea of the symetric design has been explained before.

I know you mean each bit takes less than 250mhz, that is a lot lower then
the core frequency, or what intra chip buses now deliver.

Not really. If you service an external 18-bit port or a neighbor port
in a loop and it has a 4ns read and 4ns write and a 2ns loop branch it
is going to transfer 18 bits every 10ns whether it is between
neighbors
on-chip or to an external port or device.

It may sound low compared to a 700MHz core, but the idea is that
the core has the size, cost, and power consumption of chips that
are a hundred times slower. The 10ns 100Mhz loop looks pretty fast
compared to other small and cheap designs. You can't get
everything at once, there are tradeoffs. According to
comp.arch.embedded the fastest chips out there can only bit-bang
protocols at 4Mbps maximum and apparently no one else can do
it at 20 to 30mbps.

I would have
expected that a word could have been dumped to the serial bus (link wakes
up other core with first bit, core software reads port to tos, other bits
automatically follow/streamed across and dumped into tos) and it streamed
across to the receiving core at 700mhz*18, well within the boundaries of
present intra chip busses.

Are you joking? That is what happens with the serdes. You read or
write a word and it gets converted to or from a serial stream by
hardware.
But 700MHz * 18 bits would be 12.6GHz. The idea of 12.6GHz serial
connections in .18u is very unrealistic.

The 400Mbps speed of the initial serdes is an oder of magnitude
faster than bit-bang in software and should be stepped up to 480
as in USB or 1000 as in ethernet but don't expect to see it
stepped up to 12,600mbps any time soon.

I know there was discussion on conflicting bus request to one core,
something I just assumed would have been solved under the system. I put
forward a simple and elegant solution. The core is woken up and receives
(it does this so often as a root function of the chip maybe there should
be a special instruction) the pattern notifying it which liens have data.
The suggestion is this, that it receives notification which it handles,
but all new lines that become active during notification and processing
set off a new wake up and notification the next time it is to sleep (in
other words when it checks, before it hits sleep state, it sees that there
are additional notifications and then continues on without entering sleep
state, and any new lines then are deferred to the next attempt to read the
port) (now I can't remember how it was designed) and other cores are not
woken up as having received the data until the deferred notification goes
through. In this way (I think I remember correctly) no notification can
ever be missed except through software design fault, and the system
robustly handles overlapping requests. Am I right in my naive assumption
here Jeff, and does this sound any good?

It was looked at years ago.

10ns loop in software.

Ok, thanks for that. My goodness, that is slow, even for software, that is
10Mhz..

No 10ns is 100MHz not 10Mhz. ;-)

I know that zeros confound some programmers but you seem to have
missed one. I would remind you that we have had discussions in
c.l.f about bit-banging a square wave on a 60Mhz ARM at 200KHz
with an interpreted threaded Forth in Flash, at 2MHz using
an optimizing "C" compiler, and at 4MHz using an optimizing
native code Forth compiler for ARM. The difference between
4 and 10 is big but the difference between 4 and 100 is more.

You have demoted the Forth chip down to 10Mhz from 100Mhz.
10MHz is still faster than most stuff but you have missed
one of those pesky zeros. You keep making public statements
off by one order of magnitude on this. Answering usenet
questions can be like a tar baby.

So, you are saying sram would be restricted to 10Mhz, not 100 or
200Mhz?

No that's not what I said at all. I said that the fastest you can
stream a parallel port is 100MHz at 18-bits or 1.8Gbps and
with two this is a maximum of 3.6Gbps or 200Mwps. I also
explained in detail that that is not the signal fed by memory
chips.

Memory chips have address bus and control bus signals in addition
to their databus signals. When you factor those things in you drop
below 100MHz. But no a 10ns loop is not running at 10MHz! :-(

I should point out, that in my previouse 18bit serial suggestions, you
would have one line automatically dumping a 18bit word to bus at each
cycle, the top of the bus might contain the pattern to tell which it was
coming from, and the second the value, still the next four memory
locations could have four memory locations to jump to (and numerous other
ways to handle it) the easiest way (but not for the stack) would be dump
the pattern then the values of active cores that are current on the stack,
but that is up to 5 stack words gone and some buffering internal registers
to parallel feed fro the stack. I can see the point, that if all four
liens were to come in serially one bit at a time, you could parallel
handle 4 streams, but you still have software over heads compared to a
word at a time. If you had a system where the cores would accept the four
liens at anytime in parallel, into four 18 bit registers, and every
attempt to read which is current transfers the the current ones onto the
stack with a pattern to tell where they come from, then the processor only
sleeps if there is no communications current, but for the cost of four
registers and some extra circuitry (I am not suggesting a complex
automated system, but only the most simple working with existing serial, +
one control serial/parallel line). But to get around the increased stack
depth problem, the current could be read round robin style, where the next
current line in the series is returned next with a tos pattern to tell
which one (this is a previouse scheme of mine where it stops a port high
up in the series from getting preference and hogging the processing time)
this might require some circuitry or a 4 bit/value register.
Alternatively you could receive, a wake up and pattern, and read a memory
mapped 18 bit virtual port, to read in 18bits of that serial line
(through notification on the serial/parallel control lien. There are
other ways to handle this, but it would be obvious to Chuck from this.

There are literally thousands of design contraints that one has to
be compatible with when making any change. The above paragraph would
need a couple of man months of discussion to flesh out into anything
real. If someone can sit down and design a circuit in a month or
two and show Chuck how it works he is happy. If someone expects him
to stop doing his work and discuss the design changes that they
think might be possible for a long time he is not so interested.
You quickly learn that he had thought through what you usually
think is a new idea to you and after a few hours of explanations
about how other things work and related constraints in other
parts of the design you realize that Chuck considered what
seemed like a new idea to you long ago.

Some nodes have to service the external bus and others can
execute the stream. You normally only read a port with a data
read and can only call or jump to an external port being used
as a memory bus and execute the code there to go somewhere
else because a processor has to manage this external port. The
external address bus incrementation, external control signals,
and next read of that port are not going to be driven by
an automatic hardware memory interface.

Which is exactly what I knew, and assumed a take down on performance from
that, but 10Mhz?

As I say it is a bit dangerous to answer a dozen technical questions.
Even if you give several paragraphs of detailed explainations it will
often lead to things like a 10ns loop being repeated called a 10MHz
loop in follow up posts. And each answer when spun like that
generates not only more questions but more confusion for other
people.

So, a single chip can't do all the handshaking, I remember now (thinking
wrong sort of memory)?

A single node (not chip) on the 40 core can't do all the
handshaking.
If one core supports the data bus and another core supports the
address
bus and another core supports the control bus as on the 40 core it
takes three nodes. On s24 core 00 had all three external busses as
left, up, and IOCS ports. It could do all the handshaking but
could not as fast as three nodes in parallel. Some other node
can execute the streams read from external memory but not the
nodes that are servicing the external ports.

Otherwise this is pretty much what I expected.
What John was talking about, is me preferring a automated memory interface
for code execution (but serial instead of the parallel I wanted) which I
knew wasn't on present chips.

I have to admit that I find it pretty funny that many people who
didn't
get chips like F21 a decade ago now don't get more modern chip designs
and who say that now they are starting to understand the old designs.

What doesn't stand to reason is how that helps the REAL problem for
running
Java on modern MISC.

Optimizing a Java implementation on a chip like F21 or i21 with
megabytes of addressable external memory and stacks but no local
RAM is fairly straightforward. Mapping it to tiny parallel
processors is a very different problem. One might be able to
make a virtual java processor with a bunch or small Forth
core but it is a problem I haven't looked into.

I can answer this one, it is a 'problem', and not for the faint hearted,
but definitely with more than 10Mhz external performance to be practical
compared to the 66mhz+ Arms out there.

You make me regret giving you the detailed 10ns explanation when you
then turn it into repeated 10Mhz public statements. Yikes!

MISC went in the other direction. From the 32-bit bus ShBoom things
went to a 21-bit address memory bus. But it only had 10-bit pages
because as Chuck had written in the first Forth paper in 1968 "Most
Forth applications fit in 1K."

I would just like to add to your discussion, that somebody once said that
68K was more than enough for any computer program, and another suggestion
640K was enough :) .

Sure but that was before Microsoft became a large investor in Intel.
Marketing was able to use the increased hardware/software requirements
to sell more product. But it is very different than the idea
behind Forth chips.

I found it funny that people used to Windows insisted that
a 4megabyte address space was not big enough to do anything
despite most applications being about 1K in size.

If you said that most Forth applications fit in a few K
Forth words people tended not to wince. If you said that
Forth words were mostly five-bits instead of thirty-two
bits calling words with multiple thirty-two bit words
that few K became a K or so. I found it funny that so
many people argued that it wasn't true in the nineties.

But Chuck's saying that most Forth applications fit in
1K wasn't anything new. I found the statement in the
1968 Forth paper, although there it was about how
the source fit in one block. After fourty years of
work to improve the density of the software down
to replacing a page of old Forth code with a
five-bit opcode Chuck's comments about code size
have a context.

And I have to admit that I found it funny that after
all the people made fun of 1K Forth code that Chuck
proposed making something with 64 words because
it seemed big enough to Chuck and people used to
very dense Forth code.

Integrated mixed chips. Misc: Control core with large memory + array DSP
cores, + custom cores and bits and pieces, anything can be turned on or
off at any time.

That sounds very conventional and like what most people do.

Putting RAM in a core was new to Chuck and the design has evolved
over the years. There are obvious advantages to having more memory
on-chip. One variation would be to have core sized chunks of
on-chip memory to allow two basic building blocks, core and ram.
In the future there may be chips with larger on-chip memory
or more multiply hardware or more sophisticated and higher
speed I/O circuits. But who knows.

The cheapest cell phone a couple of years was $18 retail. The price

I was talking of quoted manufacturer cost a coupe, of years ago, is the
one you are quoting call only, or subsidised?

$18 retail is the unsubsidised cost, the whole thing, no hidden
costs in a bundled usage plan. The processor cost was around $1
so there wasn't a large margin in reducing processor cost. The things
are incredibly complex for an $18 appliance and unless you own
the plants that make the components and are dealing with very
large volume you can't compete with that or hope to recover
development costs.

Best Wishes
.



Relevant Pages

  • Re: Anyone have a memory test program for Z80 based S-100?
    ... > Whether or not the memory system supports the cycle time of an M1 ... > "system design", I mean things as simple as making sure you select ... at least for the older Zilog chips. ... memory read/write test programs, AND "worm" programs. ...
    (comp.os.cpm)
  • has anyone made PLB_DDR work with 1Gb DRAM chips?
    ... I have a design that consists in part, of the DDR interface from the ML403 board, but with two of these 1Gb DDR parts: ... On the PLD_DDR instance in the mhs file, I increased the row and column addresses each by 1, increased the memory space on the PLB bus by 4x, and hooked up the additional address pin in the mhs and ucf files. ... When I configure the design for the smaller memory chips, the 1Gb chips suddenly start working correctly, but of course the memory is the old 64MB size. ...
    (comp.arch.fpga)
  • Re: Early dual processor chips?
    ... the transputer ISA though. ... chips ... For the Transputer design there was always the intent at least in the ... complicate and extend the single core before ...
    (comp.arch)
  • Re: Forth as an operating system
    ... if they're real useful. ... answer) is how long it takes a core to do a 18x18=36 unsigned ... And most people who design chips have been burned by marketing announcing ...
    (comp.lang.forth)
  • Re: DDR SDRAM access with MPMC2, Databus Width
    ... Use the MPMC2 GUI to generate custom MPMC2 core!! ... I'm working on a Virtex-4 FX12LC Design with PPC405 Core, ... Multiport Memory Controller 2. ...
    (comp.arch.fpga)

Loading