Re: Parallel application ran more slowly than the sequential one?
- From: Randy <joe@xxxxxxxxxxxxxxx>
- Date: 12 Jan 2006 14:17:11 -0700
Patricia wrote:
> Could you tell me what are the reasons to make an parallel application
> run more slowly than an application on single standalone PC(sequential
> one)? All the applications were doing the same job. Their main
> difference is the parallel one uses much more complex data structure
> and algorithm. I've checked the hardware information, my system
> administrator told me it should not be hardware slow-down the speed.
> Maybe there have been too many messages passing between processors or
> something related to memory usage because of very complex data
> structure?
>
> What is the most likely reason? how could I find out it?
>
> Best Regards
>
> Patricia
A few possible reasons for serial --> parallel slowdown:
1) Your parallel algorithm may be less efficient than your serial algorithm.
2) It takes time to create and coordinate parallel processes (remote
process spawning, MPI initialization, data movement over the network,
loading/unloading MPI data structures, sharing files over NFS, process
synchronization, etc).
3) There is a load imbalance among the processes (some processes finish
early and do nothing until the slowest process finishes, then they can
all proceed to the next MPI call). This implies that the workload is
disproportionately distributed among the processes, or that the data
crunched by some processes can't converge as rapidly as the data
belonging to other processes (or can't be searched equally quickly, etc,
etc).
4) There are pathologies in your system: mis-configuration (MPI,
network, compiler flags, process memory is swapping to disk, etc), other
(perhaps stray) background or foreground processes are interfering with
your runs, etc. This *may* be evident if you run your tests repeatedly,
or compare serial to parallel runs, but you may have to monitor the
system explicitly to discover such bugs.
A good way to assess your parallel program is:
1) Establish a serial baseline:
Compare the runtime of your parallel program (using only one process) to
the runtime of your serial program. The one-process-parallel program
should run comparably as fast as the serial program, but must take
longer once it goes parallel since it has extra instructions (to support
parallel processes) that the serial program does not. However, if the
algorithm differs between the serial and parallel versions, you need to
have some idea of this starting baseline, otherwise it will be difficult
to compare their initial runtimes (or to accurately assess the parallel
speedup).
BTW, remember to time your program with a wall clock timer like
MPI_Wtime or gettimeofday(2), and NOT process timers like time or timer.
Remember also that code profilers slow down your code between perhaps 5%
(for a function-level profiler) and 50% (for a source-line-level profiler).
2) Compute the parallel speedup:
Compare the wall clock runtime of your parallel program using increasing
numbers of processes that are appropriate for your task decomposition,
usually using 1, 2, 4, 8, 16, etc, processes. If it scales up well,
you'll usually see the parallel program's runtime at 2 or 4 processes
shrink below that of the serial program's (begin to take less time). Of
course, the parallel program's runtime should shrink further as you add
even more processes.
However, if the program speeds up only a little before the addition of
more processes makes little or no difference, then either your parallel
algorithm is too inefficient, or the fraction of your program that has
been parallelized (that speeds up through parallelism) is too small to
have much impact on the total runtime (as described by Amdahl's Law).
Remember too, even if the runtime of your parallel version never falls
below that of your serial version, if the parallel version can support
more data (or higher precision) than the serial version, then it still
adds value.
3) Run each test repeatedly:
Compare the shortest runtimes across tests (e.g. the shortest runtime
for 1 process vs. the shortest runtime for 2 processes, etc). This will
help avoid occasional resource competitions or obstructions that arise
when using any nondeterministic shared resource environment (like a
multiuser O/S).
Randy
--
Randy Crawford http://www.ruf.rice.edu/~rand rand AT rice DOT edu
--
.
- References:
- Prev by Date: Parallel application ran more slowly than the sequential one?
- Next by Date: moderator taking a vacation
- Previous by thread: Parallel application ran more slowly than the sequential one?
- Next by thread: moderator taking a vacation
- Index(es):
Relevant Pages
|