Re: Can anyone explain why this is happening?
- From: Randy <joe@xxxxxxxxxxxxxxx>
- Date: Thu, 20 Oct 2005 21:05:23 -0500
syraero@xxxxxxxxx wrote:
> I am using 16 CPU-local clusters to run a parallelized Computation
> Fluid Dynamics code.
>
> In the middle of calculation, I frequently overwrite a restart file so
> that this restart file can be read into all CPUs simultaneously.
>
> The problem is that, soemtimes, it looks like one or two nodes read the
> unreasonable values from this restart file.
>
> So.. I reboot the computers and tried again and it worked file after
> the reboot.
>
> Can anyone please tell me why this is happening?
>
> Thank you.
Is it possible that your writes do not include a local buffer flush
before the next read? In C, printing a \n (newline) will force a buffer
flush, as will a call to sync(2). In many fortrans, you must call
_flush (or the same sync(2)) to expel the data onto disk. Of course, if
the writing process closes the file, that should also force a flush.
It's possible that the shared file's data integrity sometimes is being
compromised when a process' recently written data has not yet been
flushed to disk, but another process reads the file before the write is
flushed. If a remote process prints, but does not flush the data from
its unix/kernel I/O buffer to NFS, then NFS remains unaware of the new
data and another process may not see the update.
You also should be able to avoid this by:
1) having each process could call fsync(2) before it reads from a newly
written file (to guarantee that newly written data have been flushed to
disk).
2) having all processes (or your batch scheduler) call unix's sync(1)
before it reads from a new file (which also flushes all buffered written
data to disk).
3) switching to MPI I/O read and write routines, instead of read/write
via NFS. These are part of ROMIO or OpenMPI (or some of the avant garde
or commercial versions of MPI). I *think* MPI I/O's services assure
that each process' write buffers are flushed with every print/write, but
I don't know for sure.
Randy
--
Randy Crawford http://www.ruf.rice.edu/~rand rand AT rice DOT edu
.
- Follow-Ups:
- Re: Can anyone explain why this is happening?
- From: Greg Lindahl
- Re: Can anyone explain why this is happening?
- References:
- Can anyone explain why this is happening?
- From: syraero
- Can anyone explain why this is happening?
- Prev by Date: Re: Can anyone explain why this is happening?
- Next by Date: Re: Can anyone explain why this is happening?
- Previous by thread: Re: Can anyone explain why this is happening?
- Next by thread: Re: Can anyone explain why this is happening?
- Index(es):
Relevant Pages
|