Re: Future memory modules



JJ wrote:
> Dysthymicdolt@xxxxxxx wrote:
> > RLDRAM2 may be suitable to networking applications, but it
> > seems to be poorly designed for an L3 memory. While the
>
> Ofcourse many networking applications are highly interleaved so the
> latency can be pretty well hidden and issue rate exploited, same with
> some processor designs. RL has fast-large doodad written all over it,

Are you thinking heavily multithreaded designs like Sun's Niagara
(8 4-way SMT scalar SPARC cores)?

> just work with it some. Shame about the banking, it should have been
> several times finer. Now if it had been 64K banks or so, the design
> would effectively look like an 8 stage pipelined SRAM with full 2.5ns
> issue rate.

Of course, such would have reduced density/increased cost
per bit . Anyone guess by how much?

> > latency might not be bad for such a potentially large cache
> > (e.g., 256 MiB using 4 chips), the design is not specifically
> > optimized for such usage (e.g., burst lengths of 2 and 4 [and
> > 8 for 9b/18b wide chips] where a single large burst length
> > could have allowed even lower latency [at least the bursts do
>
> RLDRAMs are best used for their interleaved flat memory model with high
> issue rates, not for bursty applications. If you only want bursty,
> DDR3, or RDRAM might be better.

I was thinking that a larger burst length would allow the chip design
to reduce latency further--for an 8b burst length, one would only
need one eighth of the bits to be accessible in say two cycles, the
second eighth in three cycles, etc. It would be difficult to make
half of the bits (2b burst length) sufficiently faster to significantly

reduce latency. For an L3, a smallish 64B block size would allow
for a 64b wide interface and 8b bursts (one might provide two 64b
wide ranks to increase bandwidth--as long as even and odd block
accesses are well distributed).

> > Such a huge cache presents other problems, though. For anything
> > but a high-end server processor in which all configurations
> > include such off-chip cache, the tags would have to be
> > off-chip which would significantly increase latency for a
>
> Tags add another 20% or more to width, but you are stuck in
> conventional cache architecture.

Of course, the additional width required depends on the design.
A direct-mapped cache could get by easily with less than a 12%
increase in width (sharing tag with ECC would allow a 16b per half
cycle reading of tags for the first half of an 8b burst while the
second
half read the ECC--an 80b wide interface would support one 64b tag
per 64B cache block) with parallel read of tags and data. A more
associative cache with sequential tag then data access would
add significant latency even with early way selection based on
partial tag comparison. (One advantage of a relatively large tag
memory could be the ability to include some additional information
such as previous allocated blocks or a next-fetch prediction.)

> > ISTM that RLDRAM might be more attractive as a parallel, fast
> > memory (software-managed cache). Unfortunately, OSes and
> > applications are not designed to exploit such.
>
> No we don't really want that.

What I mean is a page-size block, fully associative, non-redundant
(i.e., data swapping not data copying) 'cache' so that TLB entries
are effectively the tags. What is so horrible about that? (Obviously
the allocation policy would have to be reasonably smart since a
4KiB page swap would use 1280ns of the fast memory interface
with a similar utilization of the main memory interface. However, I
would guess that even a simple allocation with eviction only on
page unmapping might be able to boost performance enough to
justify the added cost.)

> > (It might be desirable to stripe blocks across 7 of the 8 banks
> > to maximize throughput for sequential accesses and reduce bank
> > conflict [MOD 7 is not that slow to compute]. [This could also
> > reduce conflict misses in a cache.] Of course, there is then
> > the problem of what to do with the eigth bank.)
>
> Nor that.

What is wrong with prime modulo bank striping? Adding, say,
500ps (or less?) to the access latency should not be a big issue.
POWER4/5 use a 3-way L2 cache banking presumably to
reduce bank conflicts (taking the modulus of a 'large number' of
the physical address bits).


Paul A. Clayton

.



Relevant Pages

  • Re: Has anyone produced a board using Kicad?
    ... memory is being pushed to maintain lists and objects ... provoke substantial cache thrashing, which will show up as memory ... can you quantify how large a design must be ... before it begins to hit memory limits when using gEDA/PCB? ...
    (sci.electronics.cad)
  • Re: Sine wave look up table
    ... Where do you find enough memory ... what's the likelihood of cache miss times the cache ... Near the end of the design process (somewhere between design and ... start out designing a system based on cache latency issues. ...
    (comp.dsp)
  • Re: Automatic parallelization - was Re: LISP Object Oriented?
    ... the cache coherence system. ... that package to main memory and the design of that memory system, ... ramifications far beyond the design of the package alone. ... Now, the situation changes as you add CPUs, but the hit each CPU takes ...
    (comp.lang.lisp)
  • Re: Chucks plan
    ... from having an automated memory bus, rather than the software driven bus? ... design was different, had to be designed separately, had ... This required predicting which memory chips will be most ... done with a Forth core and software this way. ...
    (comp.lang.forth)
  • Re: large binary immediately SEGVs
    ... said even running ldd on the load module caused ldd ... insisted in reducing any problem to a 4 assembler-level instruction ... memory, the text to be printed was clearly "Ready: ... Now you do not just walk into a design review and throw different ...
    (comp.os.linux.misc)

Loading