Re: 16-Node Parallel System
- From: "McDoug-Village" <brones@xxxxxxxxx>
- Date: Sat, 5 Nov 2005 17:31:40 -0800
Folks,
Opteron soon will only be dual core. (I believe mid next year)
So you're best off going there now w. the 4 core configuration, 16GB Ram.
Jon
"Randy" <joe@xxxxxxxxxxxxxxx> wrote in message
news:dk89s2$7gt$1@xxxxxxxxxxxxxxx
> A. A. wrote:
> > "Randy" <joe@xxxxxxxxxxxxxxx> wrote in message news:dk5lu5$1k4
> >
> >
> >>In practice, the interconnect of a SMP's 16 CPU interconnect will not
> >>match the bandwidth of the sum of the bandwidth of the 8 interconnects
> >>within 8 dual-SMP nodes -- it just costs too much. You will pay a lot
> >>more money for the 16-SMP's single bus, and in general, its bus will
> >>move perhaps half as much data per second as the 8 dual nodes' 8 buses
> >>would. That means you won't see a linear speedup of a program on the
> >>SMP, even if you use an embarrassingly parallel code in which the
> >>processes are fully independent (e.g. no cache line interference like
> >>trashing or thrashing).
> >
> >
> > Randy, thank you for your elaborate and informative follow-up.
> >
> > I don't think that 16 CPU interconnect is an option for me, as
> > the maximum number of Operatons that be used on a single board
> > is 8. What I meant is 16 MPI nodes. So my options are as follows:
> >
> > A. 1 Eight-way mb with 8 dual-core Operaton 870.
> > B. 2 Four-way mb with 4 dual-core Operaton 870 on each mb. The
> > two mb's are connected with GbE cross-over cable (no switch needed).
> > C. 4 Two-way mb with 2 dual-core Operaton 270 on each mb.
> > Connect all mb's with GbE switch.
> > D. 8 Single-processor mb with 1 dual-core Operaton 170 on each.
> > Connect all mb's with GbE switch.
> > E. 16 Single-processor mb with 1 single-core Operaton 146 on each.
> > Connect all mb's with GbE switch.
> >
> > Most of my computation involves parallel matrix calculations and
> > data exchange. So I expect more memory access and processor-to-
> > processor data transfer than hard drive access. However, I must
> > use MPI as opposed to OpenMP or threads because the numerical
> > libraries I need are MPI-based.
> >
> > If price is not the major factor in deciding (since most of the
> > above configuration options only differ by about $2k), which
> > of the above settings would you think is the most efficient--
> > i.e. will give higher speedups and scales better for up to 16
> > MPI nodes (not necessrily 16 processors)?
> >
> >
> > -ammar
>
> Ammar,
>
> Message latencies across cluster interconnects have been improving
> rapidly of late, and I'm not sufficiently up-to-date on the market to
> say exactly which architectural combination of shared and distributed
> memory will run your code fastest, nor which will give you the best
> performance per dollar. I very strongly suspect that nobody can answer
> that reliably without benchmarking your code on the candidate systems.
>
> That said, several options will clearly outperform others.
>
> Dual core CPUs, as well as dual CPU nodes must share main memory and
> sometimes L2 caches. If your code does NOT make good reuse of caches,
> then you'll easily get the best memory performance from single CPU
> nodes. Dual CPUs will cut your RAM bandwidth in half (unless the node
> splits the memory into halves which are local to each CPU), and dual
> double core CPUs will cut it to 1/4. Regardless, dual core CPUs ALWAYS
> have 1/2 the main RAM bandwidth of single core CPUs. But because they
> do not share L1/L2 caches, cache friendly code will not suffer as much
> performance loss (since most memory accesses will be within each core's
> unshared L1 or L2 cache). But the precise amount of cache friendliness
> will depend on your code, especially your matrix library. I strongly
> recommend that you look at getting a matrix library from AMD or Intel if
> you do a lot of matrix math. Building ScaLAPACK/PETSc (or even LAPACK)
> on top of a tuned math/matrix library can make a big performance
difference.
>
> In short, if RAM bandwidth is important, especially if your code is not
> cache friendly, your program may run significantly faster on a single
> core CPU. I would add another entry to your list of list of candidate
> architectures: a dual CPU motherboard (with single cores). These have
> been around a long time, and should be priced low (as commodities) by now.
>
> Likewise, a higher ratio of network boards per CPU core (per cluster
> node) will provide more interconnect bandwidth between nodes. If you
> buy a node with only one network card and 4 cores, that card better be
> Infiniband (IB), at least.
>
> If I were you, I would price out some low end Infiniband network boards.
> From what I hear, their price has dropped into the GigE territory in
> the past year, with better bandwidth and a lot better latency. However,
> I don't know if IB can be used effectively without a switch, which would
> obviously add significant cost. But without a switch, GigE too will
> suffer under heavy traffic (and packet collisions). All things being
> equal, retransmits across IB will be a lot faster than across GigE.
>
> To run matrix code, my preferred choice might be: 8 dual CPU
> motherboards, with single cores, interconnected with Infiniband. It's a
> tried and true formula, available from any vendor. And by now, with
> many such systems already out there, it's probably so well enough
> documented on the web (somewhere) that you could build your own cluster
> and tune it very effectively, all by yourself.
>
> Randy
>
> --
> Randy Crawford http://www.ruf.rice.edu/~rand rand AT rice DOT edu
.
- References:
- Re: 16-Node Parallel System
- From: A. A.
- Re: 16-Node Parallel System
- From: Randy
- Re: 16-Node Parallel System
- Prev by Date: Re: Error In MPI Send and Recieve
- Next by Date: MPICH2 and visual C++ 6
- Previous by thread: Re: 16-Node Parallel System
- Next by thread: Re: 16-Node Parallel System
- Index(es):
Relevant Pages
|
Loading