Re: implementing arbitrary combinational functions using block rams
- From: anand <writeanand@xxxxxxxxx>
- Date: Sun, 10 May 2009 16:57:09 -0700 (PDT)
Thanks to all of your for your responses
Jacko: You are right - the duplication needs to happen depending on
the #read ports to the RAM.
However, we have a custom flow that allows time-multiplexed access to
N read ports (usually
we keep it around 20, can go higher to 32 or even 50 at the expense
of performance, since
more clock edges to the BRAM are needed to read all 50 ports, which
effectively has the effect
of "stretching" the DUT clock cycle === lesser frequency of DUT clock
=== lesser performance.
In other words, our custom flow maps N ports from RTL to 2 physical
ports in BRAM through
time multiplexing.
whygee: Yes, PAR effort will be more. However, when you consider a
multi-fpga design (I am talking
250 FPGA design at a minimum), the number of BRAM's wasted is
tremendous. The LUTs are usually
the bottleneck, not the BRAMs. THis is why the more logic we infer in
BRAM's the better.
JeffC: Thanks for the -bp "map slice logic into unused block rams"
option.
(a) Is it part of Xilinx ISE 11.x? Can you give me some examples
maybe?
(b) does it work well? I See some posts that have mentioned that it
doesnt work well.
what is your experience/suggestion?
I ask this because we have developed a custom flow (as I mentioned
that does something like this:)
RTL Parser, Synthesis (to Virtex 4 gates), Partitioning (a multi FPGA
design so to get the
optimal communication between FPGAs and avoiding long combinational
paths as much as possible).
The net result is User RTL is transformed to multiple EDF files. On
top of this, we add some
vendor specific`instrumentation logic and C code to allow emulator
access to certain internal signals, and other debug features. THEN, it
goes thro the normal xilinx mapping, place and route flow etc. Then
finally, we do some custom post-processing (after the bit files for
the design have been generated) before stimulating the user RTL model
with test content.
Peter: Yes, you are right about the clock. However, based on our
custom flow that I have given an overview of,
the memory operations are transparent to the end-user - i.e all of
the BRAM Read/Write operations happen
in a time multiplexed fashion, and multiple R/W accesses in a single
cycle is done using a memory clock
that is super-fast compared to the emulator "tick" phase length (or
DUT frequency in other words).
In other words, between DUT Cycle N and N+1, there are K memory
clocks that allow for multiple
BRAM access (where K > N). Secondly, the user does NOT see memory
clock and cannot access it - it is completely transparent. Because of
this artefact, if we increase the num ports haphazardly, the DUT clock
cycle will get stretched automatically, thereby reducing performance.
But this rarely happens since
the combinational delays from fpga-fpga are way more than this
"stretching" effect - effectively, the
bottleneck would still be the long combinational paths that span
several fpgas.
Now given this scenario: My question is - how can I use this as a
combinatorial lookup ROM?
WHat are the pros and cons?
In your paper "Creative Uses of Block RAM" @ (An excellent paper
btw, it piqued my curiosity!)
http://www.xilinx.com/support/documentation/white_papers/wp335.pdf
you mention that one can implement sine/cosine functions as a lookup
table.
Why not use it for arbitrary combinational logic as well using lookup
ROMs and use BRAM's that are lying
waste anyway?
Thanks once again and looking forward to have a good discussion with
all of you.
.
- Follow-Ups:
- References:
- Prev by Date: Re: Which alternative prog to use for hdl handling ?
- Next by Date: Re: implementing arbitrary combinational functions using block rams
- Previous by thread: Re: implementing arbitrary combinational functions using block rams
- Next by thread: Re: implementing arbitrary combinational functions using block rams
- Index(es):
Relevant Pages
|