Re: 16-Node Parallel System



Folks,

Opteron soon will only be dual core. (I believe mid next year)
So you're best off going there now w. the 4 core configuration, 16GB Ram.


Jon


"Randy" <joe@xxxxxxxxxxxxxxx> wrote in message
news:dk89s2$7gt$1@xxxxxxxxxxxxxxx
> A. A. wrote:
> > "Randy" <joe@xxxxxxxxxxxxxxx> wrote in message news:dk5lu5$1k4
> >
> >
> >>In practice, the interconnect of a SMP's 16 CPU interconnect will not
> >>match the bandwidth of the sum of the bandwidth of the 8 interconnects
> >>within 8 dual-SMP nodes -- it just costs too much. You will pay a lot
> >>more money for the 16-SMP's single bus, and in general, its bus will
> >>move perhaps half as much data per second as the 8 dual nodes' 8 buses
> >>would. That means you won't see a linear speedup of a program on the
> >>SMP, even if you use an embarrassingly parallel code in which the
> >>processes are fully independent (e.g. no cache line interference like
> >>trashing or thrashing).
> >
> >
> > Randy, thank you for your elaborate and informative follow-up.
> >
> > I don't think that 16 CPU interconnect is an option for me, as
> > the maximum number of Operatons that be used on a single board
> > is 8. What I meant is 16 MPI nodes. So my options are as follows:
> >
> > A. 1 Eight-way mb with 8 dual-core Operaton 870.
> > B. 2 Four-way mb with 4 dual-core Operaton 870 on each mb. The
> > two mb's are connected with GbE cross-over cable (no switch needed).
> > C. 4 Two-way mb with 2 dual-core Operaton 270 on each mb.
> > Connect all mb's with GbE switch.
> > D. 8 Single-processor mb with 1 dual-core Operaton 170 on each.
> > Connect all mb's with GbE switch.
> > E. 16 Single-processor mb with 1 single-core Operaton 146 on each.
> > Connect all mb's with GbE switch.
> >
> > Most of my computation involves parallel matrix calculations and
> > data exchange. So I expect more memory access and processor-to-
> > processor data transfer than hard drive access. However, I must
> > use MPI as opposed to OpenMP or threads because the numerical
> > libraries I need are MPI-based.
> >
> > If price is not the major factor in deciding (since most of the
> > above configuration options only differ by about $2k), which
> > of the above settings would you think is the most efficient--
> > i.e. will give higher speedups and scales better for up to 16
> > MPI nodes (not necessrily 16 processors)?
> >
> >
> > -ammar
>
> Ammar,
>
> Message latencies across cluster interconnects have been improving
> rapidly of late, and I'm not sufficiently up-to-date on the market to
> say exactly which architectural combination of shared and distributed
> memory will run your code fastest, nor which will give you the best
> performance per dollar. I very strongly suspect that nobody can answer
> that reliably without benchmarking your code on the candidate systems.
>
> That said, several options will clearly outperform others.
>
> Dual core CPUs, as well as dual CPU nodes must share main memory and
> sometimes L2 caches. If your code does NOT make good reuse of caches,
> then you'll easily get the best memory performance from single CPU
> nodes. Dual CPUs will cut your RAM bandwidth in half (unless the node
> splits the memory into halves which are local to each CPU), and dual
> double core CPUs will cut it to 1/4. Regardless, dual core CPUs ALWAYS
> have 1/2 the main RAM bandwidth of single core CPUs. But because they
> do not share L1/L2 caches, cache friendly code will not suffer as much
> performance loss (since most memory accesses will be within each core's
> unshared L1 or L2 cache). But the precise amount of cache friendliness
> will depend on your code, especially your matrix library. I strongly
> recommend that you look at getting a matrix library from AMD or Intel if
> you do a lot of matrix math. Building ScaLAPACK/PETSc (or even LAPACK)
> on top of a tuned math/matrix library can make a big performance
difference.
>
> In short, if RAM bandwidth is important, especially if your code is not
> cache friendly, your program may run significantly faster on a single
> core CPU. I would add another entry to your list of list of candidate
> architectures: a dual CPU motherboard (with single cores). These have
> been around a long time, and should be priced low (as commodities) by now.
>
> Likewise, a higher ratio of network boards per CPU core (per cluster
> node) will provide more interconnect bandwidth between nodes. If you
> buy a node with only one network card and 4 cores, that card better be
> Infiniband (IB), at least.
>
> If I were you, I would price out some low end Infiniband network boards.
> From what I hear, their price has dropped into the GigE territory in
> the past year, with better bandwidth and a lot better latency. However,
> I don't know if IB can be used effectively without a switch, which would
> obviously add significant cost. But without a switch, GigE too will
> suffer under heavy traffic (and packet collisions). All things being
> equal, retransmits across IB will be a lot faster than across GigE.
>
> To run matrix code, my preferred choice might be: 8 dual CPU
> motherboards, with single cores, interconnected with Infiniband. It's a
> tried and true formula, available from any vendor. And by now, with
> many such systems already out there, it's probably so well enough
> documented on the web (somewhere) that you could build your own cluster
> and tune it very effectively, all by yourself.
>
> Randy
>
> --
> Randy Crawford http://www.ruf.rice.edu/~rand rand AT rice DOT edu


.



Relevant Pages

  • Re: Advice please to choose cpu & chipset
    ... acceleration which offloads some of the work from the CPU. ... they are reasonably power efficient compared to ... with an OEM system swapping to a lower RPM fan for noise ... run, a dual core system is highly preferred, quad core more ...
    (alt.comp.hardware.pc-homebuilt)
  • Re: what are the most popular building and packaging tools for python ??
    ... I don't think it's a stretch to imagine a CPU core with a "secure kitchen" ... Using this kind of system, a customer would give you his CPU's public key and serial number, ... >net is available and must still offer full functionality no matter what. ...
    (comp.lang.python)
  • Re: Atmel releasing FLASH AVR32 ?
    ... the solution is to have more associativity in the cache. ... if you can solve it with a multithreaded core ... additional constraint to your "interrupt" system. ... There is *nothing* that prevents a CPU ...
    (comp.arch.embedded)
  • Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
    ... core to another. ... Now, the kernel can't really do that reasonably, but user space possibly could. ... threads on one CPU. ... sleep over the lack of parallelism in the case where the SMP support is ...
    (Linux-Kernel)
  • Re: Perl is too slow - A statement
    ... Why hire a support staff to tell you to update your hardware as a bug fix? ... I assume you mean adding another cpu core as in 'dual-core cpu', ... some apps may benefit and some won't as app benchmarks tell us. ...
    (comp.lang.perl.misc)

Loading