[torqueusers] parallel jobs - not so much

Troy Baer tbaer at utk.edu
Fri Nov 1 13:52:31 MDT 2013


On Fri, 2013-11-01 at 15:42 -0400, Gus Correa wrote:
> 2. I am not sure I understood right, but it looks like to
> me that your $TMPDIR is on a local disk on the compute nodes, right?
> 
> Staging executables and data in and out to local disk
> is possible (but may be painful).
> If you want to do this for parallel jobs, you need to copy
> the executable and data to *all* $TMPDIR disks on each node 
> participating in that particular job.
> This is probably why your 2-node job fails.
> And this is also why staging in is painful (copying to all nodes).

Agreed.  If you need to do staging to and from local disk on multiple
compute nodes, you probably want to use something like pbsdcp:

http://svn.nics.tennessee.edu/repos/pbstools/trunk/bin/pbsdcp
http://svn.nics.tennessee.edu/repos/pbstools/trunk/doc/man1/pbsdcp.1

Also, how is your $TMPDIR getting created?  Have you verified that it is
in fact created on all the nodes assigned to a job?

Also, if you're running an application on more than 1 node, you'll need
to use some sort of parallel program launcher (e.g. mpirun, mpiexec, or
charmrun) to start your program.

	--Troy
-- 
Troy Baer, Senior HPC System Administrator
National Institute for Computational Sciences, University of Tennessee
http://www.nics.tennessee.edu/
Phone:  865-241-4233




More information about the torqueusers mailing list