Re: How does this make you feel?



John Mashey wrote:
> Steve wrote:
>
> > True enough. But, I don't believe that I have necessarily invented
> > something new, but rather identified a new wrinkle on an old problem.
> > Inasmuch as what I have described collides with previous ideas, I have
> > thought that my particular approach to making general purpose registers
> > more complex, and then applying that complexity to enhance the
> > functionality of the general instruction set, makes for something of an
> > innovation. Expanding that to VM and page table management would be
> > icing on the cake, as it were.
> >
> > Whether there is any real end-user utility to being able to issue an
> > XOR instruction that applies to a 1M range of VM; or whether being
> > able to add a large column of integers, perhaps in steps; or whether
> > being able to do large mmoves makes practical sense as a single
> > instruction, is something I cannot analyse with sufficient rigor at my
> > current level of knowledge. It _seems_ to me that the semantics of a
> > large fraction of assembler instructions could be expanded to take
> > advantage of a more complex address decoder, but I have not examined
> > this in detail. Yet. But I should.
>
> Yes, but you really need to go study a bunch more, else you are wasting
> your time. [This is not to discourage anyone from having new ideas,
> it's simply an observation that there is a minimal level of
> hardware+software knowledge needed before proposing ISA extensions is
> more than a waste of time.]

Obviously I am not really familiar with all of the relevant issues in
this discussion as I don't think about this stuff every day. So bear
with me here as I address your comments as best I can.

> It's not a question of colliding with previous ideas, it's that:
> a) An amazing number of different things have been tried over the
> years, which is why it's important to know the history. Things
> somewhat like this were done 30-40 years ago, in extremely popular
> computers ... and have since disappeared, for good reasons.

The computing world was different then. Resources were constrained in
ways that are no longer relevant now. Today, memory bandwidth and IO
bandwidth are the biggest bottlenecks for many applications (that do
not spend most of their time waiting on user input) that are not also
working with very large data sets. CPU designers then had constraints
then that are now less important because of the Moore's law phenomenon.
But I certainly agree that it is useful to avoid making the same
mistakes people made previously.

> b) Of the various combinations of ISA x cache design x MMU design x
> memory system x systems design x OS x langauges, some work and some
> don't.
>
> c) The commercial systems of the 1950s and 1960s often supplied various
> memory-to-memory variable-length operations. This lives on in the IBM
> S/360 (circa 1967) and its descendents:

.... which is now aparently z/OS. Can't say I know what they're doing
with the zSeries, but it's bound to be interesting.

> SS instructions: 2 memory addresses, and a length (1-256-bytes). These
> are on arbitrary byte boundaries, and hence pretty useful for COBOL,
> PL/I, etc. The original operations included:
> NC AND character
> CLC Compare Logical Character
> MVC Move Character
> OC OR Character
> XC EXclusive OR Character
>
> (There are a bunch more including Translate (and Test), and
> decimal-string operators with 2 lengths.)
> You can use an EXECUTE instruction, with such instructions as targets
> to supply a dynamically-computed length field at run-time, i.e.:
> EX R1,move low-order byte of R1 is OR'd into a copy of "move"
> ....
> move: MVC 0(0,R2),0(R3): copies some nubmer of bytes from 0(R3) to
> 0(R2)
>
> d) S/370 (circa 1970) added some more, including (essentially) the
> "bcopy" or "memcpy" instruction MVCL, which is very close to what
> you've suggested, but actually works usefully:

It you mean by 'actually works usefully' you mean to say it is
implemented in a functioning system, then I agree.

> MVCL: Move Long:
> MVCL Ra, Rb: Ra and Rb each specify a register-pair, where the first
> register gives a memory address, and the second gives a byte-count (up
> to 16MB). This copies the data from 0(Rb) to 0(Ra), and if the second
> legnth is shorter, it uses the high-order byte of [Rb+1] to pad. This
> allows a nice zero-fill: just set [Rb+1] = 0.

Ok

> This is carefully designed to allow interrupts to happen, because no
> one is willing for interrupts to be blocked while 16MB of memory is
> zeroed/copied. That requires updating all 4 registers (adding to the
> addresses, subtracting from the lengths), so that the instruction can
> be restarted correctly.

Fine; you are describing a memory to memory copy operation implemented
in a CPU that has one execution pathway. I understand that you cannot
usually afford to accumulate pending interrupt signals while waiting
for some long operation to complete, but hardware CPU threads would
seem to obviate this concern. So long as a hypothetical MVCL
instruction does not take over the CPU core to such an extent that it
blocks the normal execution of other hardware threads then servicing
interrupts is unaffected. And of course, nobody in their right mind
would even dream of doing a large copy in the top-half of their ISR.

I don't see that this is a necessarily valid concern today.

> The manual description is a tightly-worded 2 solid pages to covert all
> the cases that can happen.
>
> This has no restrictions of alignment, and barely any of size (16MB),
> and survives exceptions without weird extra state. These isntructions
> essentailly use register-pairs as byte-string-descriptors, and are
> relatively straightforward to use.

I suppose so. However, I don't suppose it is necessarily all that
hard to set up your registers with a two-instruction sequence

MOVE ea1, Axm
MOVE ea2, Ax

In practice, within the equivalent of at least one HLL equivalent of a
functional block, the succeeding references to the Ax register will
utilize the value of 'Axm'. I don't see that as being particularly
difficult, particularly when the code generation will usually be done
by the compiler.

> the compare version is:
> CLCL Ra, Rb compares long strings
> but they left the logical operators out [no NCL, OCL, XCL]
>
> That gives "memcmp" directly.

Ok, what you have described is basically indexed register memory access
with a range. But practically speaking, specifying a change in the
behavior of a register by way of a control register has different
semantic and architectural implications to the whole CPU. Direct
comparisons to existing instructions and addressing modes narrows the
scope of this discussion a little too much.

> e) The DEC VAX (circa 1978), provided a similar, albeit perhaps even
> more baroque set of instructions, but certainly including direct
> equivalents of MVCL (VAX MOVC) and CLCL (VAX CMPC).
>
> f) Hence, the most successful mainframe ISA, and the most successful
> minicomputer ISA both had features that essentially used address:length
> descriptors to do long memory operations.
>
> g) And then, these features (essentially) disappeared from new ISA
> designs, including those for most microprocessors. While there are a
> few memory-to-memory designs done in the last10 years or so, I don't
> think they are among the really popular designs. The closest popular
> one would be X86's combination of REP + MOVS, but that's not the same
> thing at all.
>
> One might wonder why that happened...

Economics? Among other things, popular CPUs have always needed to cost
much less than mainframe CPUs, otherwise the general public would never
have bought as many home and small businenss computers as they did.
Costs have fallen, however, and what you can buy today for $1000 would
have been less than a wet dream to a programmer of the 1960's.

> + Perhaps the later desaigners were just dumb.
> - I've known lots of them, and I doubt it.

There's been a *lot* of emphasis on clock-speed improvements in the
world of microcomputers. That's certainly not the only reason why
people may not have looked closely at the idea of widening their
registers in a non-intuitive way, but I suppose there are only so many
people doing CPU architecture design, and a limited number of
architectures; and so there were a limited number of research avenues
that could be explored at any given time. Market forces would have
affected this situation as well: companies were usually expected to
produce a saleable product and could probably only afford to allocate a
small fraction of their expertise to pure research projects that
wouldn't necessarily produce near-term payoffs.

> + Perhaps the later designers were ignorant of the S/370 and VAX.

I couldn't say. I don't know what they teach these days in computer
science and engineering courses.

> - I suppose that might possibly be true now, but it certainly wasn't
> in the 1970s, 1980s, and early 1990s. Most serious microprocessor ISA
> designers were quite familiar with these, particular since some of the
> later designers had implemented the earlier ones and wer quite familiar
> with them. I.e., the IBM 801 RISC folks certainly knew S/370, and the
> DEC Alpha folks knew VAX. Certainly most people in this field had at
> least studied these ISAs, or more likely, had used eitehr or both of
> these systems for many years.
>
> + Maybe C and UNIX distorted CPU design, especially with RISCs
> - Possible, but as I've posted various times, various RISC CPU
> designers definitely cared about non-C languages and non-UNIX operating
> systems.

Most non-C languages would have been targeted to UNIX systems anyways,
but there's another factor. C is one of the few languages ever in
common use that revealed details of the underlying architecture to the
programmer during normal use. Whatever system you might be using to
write applications in, say, Ruby, you aren't going need to worry about
the CPU architecture. Plus, existing ISAs (as you put them) work well
enough for most people and most applications, so there are relatively
few people who might have cause to complain, or to think about the
issues.

> + MAybe later designers' insistence on measuring performance impacts
> versus implementation costs caused them to ignroe potentially-wonderful
> features whose only problem was that they needed a new OS and new
> language to make use of them.
> - Always possible. Personally, I'd be delighted to see a brand-new
> ISA + OS+ language combination that gave real breakthroughs. Hpwever,
> the track record for such things has rarely been good, although I still
> admire the thought behind the Burroughs B5000 ... but that was a long
> time ago.

> OK, so why?

Good question.

> - It is no accident that it takes 2 pages to describe MVCL.
>
> - It is a common mistake to count instructions executed, rather than
> cycles consumed. More than one complex design has had powerful
> instructions that were outperformed by sequences of simpler ones: S/360
> MVC was sometimes beaten by Load Multiple/Store Multiple sequences.

The approach I am advocating here does happen to have the property of
reducing code-segment memory accesses, and reducing cache use. Memory
bandwidth is an issue for non-IO bound tasks, and so reducing off-chip
memory accesses for code can only speed things up. Perhaps in the past
memory bus speeds were not so far out of sync with processor clock
speeds as they are today.

> - I've posted numerous times about the care needed to do
> memory->register or register->memory operations when the addresses can
> cross cache-line or page boundaries. They're rife with special cases,
> implementation bugs, and extra cost ... that designers are loath to
> pay, because they cost space and sometimes gate delays, and in
> practice, don't seem to yield proportionate extra performance. This is
> not to say there might not be a role for these, jsut that they don't
> seem to mesh very well with teh main lines of CPU design.

You must know that you can get around non-aligned memory accesses by
expanding the offending instruction into a sequence of aligned
accesses. I would imagine that this might happen in microcode in some
designs, and is probably a really annoying thing to deal with. But
this is a general problem that all CPUs must handle unless they
expressly forbid non-aligned accesses.

Back to memcpy... Non-aligned accesses, as well as non-aligned copy
are probably really annoying, but the problem must already be solved.
Every PCI or AGP videocard on the market probably has a general purpose
hardware engine for arbitrary copies. Hooking into such an engine for
compares or logical operations shouldn't be impossible. If such
devices are not commonly found on CPUs because their instruction sets
don't indicate them, that would seem to be an artifact of the 'main
lines of CPU design' to date.

> - The implementation issues for memory-memory are even worse, at least
> if added onto a typical CPU core. Both S/360 and VAX were designed for
> microcoded implementations, and the extra cost might not be too bad,
> although it was notable that some of the most cost-effective
> implementations [360/44, DEC microVAX] didn't implement all of the
> variable-length instructions.

I have nevery used DEC machines, so I cannot really comment here on
this.

> - Most recent CPU ISAs are designed to allow cost-effective
> implementations of pipelining and usually multiple-issue. As I've
> noted before, complex memory-addressing is one of the most
> problematical features to have in high-speed implementations. [My
> usual example is the claim that the extra address modifiers added going
> from MC 68010 -> 68020 were a mistake in this regard ... and they did
> disappear in the later coldfire derviatives.]
>
> In most designs, it may be possible to pipeline the simpler operations,
> but usually the complex multiple-memory operand things end up taking
> over all the crucial machine resources, and stop the pipeline in its
> tracks. They also tend to serialize operations to keep complexity
> down. *Sometimes*, with enough work on the design, the hardware can
> indeed do better if it know an entire address+length in one fell swoop.
> For example, in a uniprocessor with write-back caches, smething like a
> MVCL can avoid fetching a cache line that is abotut to be completely
> overwritten.

Well, there's nothing stopping the programmer from informing the CPU
about memory access patterns, but it is rarely done.

And as far as an architecture that supports some number of hardware
ins. threads, the memory bus arbitration can be fairly complicated.
You might even imagine a small bus arbitration engine that could be
configured with different policies via microcode. That way, the OS
could configure the system to limit the burst length to a sane value
for each hardware thread, and thus alleviating some concerns of
stalling. I suppose the actual instruction mix encountered in the real
world varies greatly, though, and I suppose it is difficult to
anticipate every possible ins. mix at the design phase.

As for the cache, well, another poster mentioned the WH64 ins. from the
Alpha. I think I've encountered a similar instruction when I read
about the AMD K6-2. If the compiler or the programmer can inform the
CPU when it doesn't need to invalidate a cache line (which might be the
case during a memcpy), or alternately inform the CPU that it *should*
prefetch something, then that is where those decisions are best made.
CPUs may be able to do decent branch prediction in some cases, but
anticipating memory access patterns is much harder, I suspect.

> - And finally, in many designs, the path:
> 1) fetch register(s)
> 2) add displacement (or index or shifted index)
> 3) provide address to MMU and cache
>
> is pretty important.

As in being a rather common idiom and should be as fast as possible.

> I've heard fierce arguments over features that
> might cost a gate delay or two. in the first two steps in this path.
> In some cases, the *only* adressing mode allowed is (register), i.e.,
> no displacement or indexing. AMD29000 did that, for example. Some of
> us (like MIPS) allowed only displacement(base). Others allow a few
> more, and there is legitimate room for different approaches, but this
> is the kind of stuff serious designers worry about.
>
> The *last* thing most such designers would want is a feature, such that

Are you completely sure about that?

> - In order to access just the address in the register (step 1)
>
> - The CPU has to fetch another register (the "Axm")
>
> - The CPU has to do a *variable* shift of the address register
> depending on the value of the Axm just fetched. [A fixed shift is easy
> and cheap, which is why some ISAs do shifted-index, say of 1, 2 or 3
> bits.] An large arbitrary variable shift is not so cheap.

I suppose your complaint here is that the effective width of each
register is possibly doubled, and that there's more glue than you would
expect attached to each each one. A crossbar attached to each register
and a serialized access process would add a lot complexity and delay to
each register access, but would not increase the physical width of the
register. Without a crossbar, I guess the solution might be to shadow
the register and modifier to a more usable result on store.

> - And if that's not bad enough, the net effect is that *almost any*
> instruction seems like it could turn into a multi-memory-operation, and
> this is only discoverable in the address generation stage, not in
> instruction decode. This is really bad news in aggressive pipeline
> designs, such as speculative out-of-order ones, as it multiples the
> resources needed to track instructions, and complexifies the load/store
> units. From seeing what went on with S/360/VAX designs, I think this
> feature would make decent pipelining expensive in the way that most
> irriates CPU designers, i.e., that there is a lot of complexity needed
> to cope with rare cases.

Um. I think the CPU would know, from the content of the register
descriptor, when the opcode arguments are resolved during instruction
decode. The set-up to handle a complex address should occur in-line
with this stage of the instruction execution, as would syntactic and
semantic validation. Yes, I guess this would irritate the guy who has
to design the ins. pipeline, but unless I completely miss the mark,
pipelining is a response to on-chip wavefront propogation speeds vs.
clock-cycle frequencies. In a design that relies more heavily on CPU
threading, pipelining could be much less important. Of course, I
don't really know what the timing numbers are like when you measure the
factors related to pipeline economics, and so I could be completely
full of crap here.

> - Aggressive current CPUs have deep load/store queues of address and
> data that are "in flight", and need to make sure the right things
> happen in all teh cases. Multiple memory operations don't help this
> any, and may make it a lot worse.

Making a ranged instructions operation atomic would simplify coherency
problems. i.e. you only stall another in-flight instruction if it's
memory accesses collide with an instruction running in another context.
This raises the potential for deadlocks and delays in interrupt
handlers, but might not otherwise pose a serious risk...

> - Finally, it is unclear that this feature helps much, at least if
> added to typical current designs. As far as I can tell, the address
> starts on a power-of-2 boundary, which means it's not directly useable
> for memcpy. With the possible exception of
> writeback-cache-optimization, it's hard to see how this is much faster
> than the straightforward instructions in current CPUs, where people
> have worked very hard to optimize loads and stores and overlap them.

memcpy is a special case. This point can be moot because the compiler
and the operating system's memory allocator will arrange things so that
data structures are usually aligned. Image processing tasks, say, are
not always helped by this, but that's life. But what if the CPU
borrows an engine from the co-processor world? The whole issue
becomes largely moot, I think.

> - Also, of course, the feature, as described, is non-interruptable.
> MVCL and its siblings were interruptable for good reason...

Dealt with above, hopefully with sufficient detail.

> So, like I said, study some more. A while ago, in the WIZ thread, I
> posted suggestions for digital design knowledge desirable for software
> people to participate meaningfully in this turf:
> http://groups-beta.google.com/group/comp.arch/browse_frm/thread/a060bc84cdc66f60?scoring=d&q=wiz+mashey+&hl=en
>
> and there are related discussions nearby in that thread.

I'll certainly have a look at those discussions, but I thought I would
clarify my position first. Hopefully I have not completely
misunderstood your objections and have described a plausible scenario
for the kind of registers I suggest. Perhaps there is no way an
existing design could be easily adapted; that is not my concern. If it
is entirely impractical to design registers this way, and that the
memory access paradigm implied is too unweildy for a gain that is too
small, then I will find out when I learn more about the field. But it
does not seem out of the question with what I know.

Everything depends on the logic and architecture specifics that are
required to support complex addressing schemes for arbitrary opcodes.
Without sitting down and really hammering out the details of an
instruction set and the specifics of its addressing modes; and without
proposing a specific on-chip arrangement of resources, it is going to
be difficult to say anything conclusive about my hypothetical approach.
As a designer, you apparently have some serious objections off the top
of your head. I respect your expertise, but what you have written does
not seem to present any show-stoppers. Perhaps a more detailed study
would show otherwise, and perhaps I don't know enough about CPU design
to comment meaningfully. But I don't think I'm totally off the mark.



Regards,

Steve

.



Relevant Pages

  • Re: The coming death of all RISC chips.
    ... those has to be a branch or FPU instruction. ... performance killing restriction of only one register per op. ... Exposing the accumulators to the instruction set means that you cannot ... build a lower end CPU with fewer accumulators, or a higher end CPU with ...
    (comp.arch)
  • Re: Admired designs / designs to study
    ... But the instruction set was a complete bitch to work with. ... some sample assembler code in their ISA before building the cpu. ... It had an 8 bit opcode, and 16 regs needs 4 bits and that uses too ... the memory address register to use in memory access instructions. ...
    (comp.arch)
  • Re: Designing my own architecture to be simulated in software - need help with the ISA
    ... > memory address range is limited to 16 bits. ... > I'm reserving the HO byte of the word for the instruction type, ... The register operands are half-bytes in length, ... > sub - store the difference of two registers in a register ...
    (comp.arch)
  • Re: How does this make you feel?
    ... > instruction was added that allowed you to, for instance, extract a column ... >> + Maybe C and UNIX distorted CPU design, ... > supporting interrupts efficiently was a design goal for the transputer. ... behaviour of a conventionally conceieved register. ...
    (comp.arch)
  • Designing my own architecture to be simulated in software - need help with the ISA
    ... I'm reserving the HO byte of the word for the instruction type, ... The register operands are half-bytes in length, ... defining an instruction set. ... sub - store the difference of two registers in a register ...
    (comp.arch)