Re: implementing arbitrary combinational functions using block rams



Thanks to all of your for your responses

Jacko: You are right - the duplication needs to happen depending on
the #read ports to the RAM.
However, we have a custom flow that allows time-multiplexed access to
N read ports (usually
we keep it around 20, can go higher to 32 or even 50 at the expense
of performance, since
more clock edges to the BRAM are needed to read all 50 ports, which
effectively has the effect
of "stretching" the DUT clock cycle === lesser frequency of DUT clock
=== lesser performance.

In other words, our custom flow maps N ports from RTL to 2 physical
ports in BRAM through
time multiplexing.

whygee: Yes, PAR effort will be more. However, when you consider a
multi-fpga design (I am talking
250 FPGA design at a minimum), the number of BRAM's wasted is
tremendous. The LUTs are usually
the bottleneck, not the BRAMs. THis is why the more logic we infer in
BRAM's the better.

JeffC: Thanks for the -bp "map slice logic into unused block rams"
option.
(a) Is it part of Xilinx ISE 11.x? Can you give me some examples
maybe?
(b) does it work well? I See some posts that have mentioned that it
doesnt work well.
what is your experience/suggestion?

I ask this because we have developed a custom flow (as I mentioned
that does something like this:)
RTL Parser, Synthesis (to Virtex 4 gates), Partitioning (a multi FPGA
design so to get the
optimal communication between FPGAs and avoiding long combinational
paths as much as possible).
The net result is User RTL is transformed to multiple EDF files. On
top of this, we add some
vendor specific`instrumentation logic and C code to allow emulator
access to certain internal signals, and other debug features. THEN, it
goes thro the normal xilinx mapping, place and route flow etc. Then
finally, we do some custom post-processing (after the bit files for
the design have been generated) before stimulating the user RTL model
with test content.

Peter: Yes, you are right about the clock. However, based on our
custom flow that I have given an overview of,
the memory operations are transparent to the end-user - i.e all of
the BRAM Read/Write operations happen
in a time multiplexed fashion, and multiple R/W accesses in a single
cycle is done using a memory clock
that is super-fast compared to the emulator "tick" phase length (or
DUT frequency in other words).
In other words, between DUT Cycle N and N+1, there are K memory
clocks that allow for multiple
BRAM access (where K > N). Secondly, the user does NOT see memory
clock and cannot access it - it is completely transparent. Because of
this artefact, if we increase the num ports haphazardly, the DUT clock
cycle will get stretched automatically, thereby reducing performance.
But this rarely happens since
the combinational delays from fpga-fpga are way more than this
"stretching" effect - effectively, the
bottleneck would still be the long combinational paths that span
several fpgas.

Now given this scenario: My question is - how can I use this as a
combinatorial lookup ROM?
WHat are the pros and cons?

In your paper "Creative Uses of Block RAM" @ (An excellent paper
btw, it piqued my curiosity!)
http://www.xilinx.com/support/documentation/white_papers/wp335.pdf
you mention that one can implement sine/cosine functions as a lookup
table.
Why not use it for arbitrary combinational logic as well using lookup
ROMs and use BRAM's that are lying
waste anyway?


Thanks once again and looking forward to have a good discussion with
all of you.
.



Relevant Pages

  • Re: UDP socket with multiple ports
    ... class to handle multiple ports. ... is not interely the same but it proves that the Socket class in special ... cases can handle multiple addresses. ... I'll take a look at DirectPlay and if it doesn't have build in support ...
    (microsoft.public.dotnet.framework)
  • Re: Utility of find single set bit instruction?
    ... register multiple data result, dependence checks seem relatively ... Inner L1 scheduler, S1, with a bitmask scheduler. ... Since machines like Nehalem have incomplete ports and bypasses, they are already solving the hard problem, or scheduling N*2 input uops onto fewer ports and bypasses. ... I've been thinking more and more of not necessarily doing parallel decode, rename, etc., of a single uop. ...
    (comp.arch)
  • Re: [opensuse] Moving to IPv6
    ... I don't have to open up multiple ... ports in the firewall to get to internal machines, ... I suspect you're misreading something. ... have to resort to non-standard ports or ssh relaying. ...
    (SuSE)
  • Re: Efficient Multi-Ported Memories for FPGAs
    ... implemented using only the fabric of an FPGA, ... read ports as the big memory appears to have. ... a block RAM with required number of multiple read/write ports using ...
    (comp.arch.fpga)
  • Re: Question abut threads
    ... The TcpListener class internally addresses any concurrency issue with respect to simultaneous, multiple connection requests. ... And any other concurrency issue that might exist when using the same port will exist when using different ports. ... I wasn't eble to receive data more than once so I took it out of the loop. ...
    (microsoft.public.dotnet.languages.csharp)