[torqueusers] Re: [Mauiusers] Urgent jobs
csamuel at vpac.org
Mon Apr 4 19:36:33 MDT 2005
On Tue, 5 Apr 2005 11:19 am, David Jackson wrote:
> Please issue your pleas again. We will always respond if we have
> bandwidth available. Please send us a specific bug description and we
> will start will that.
It wasn't me, I've copied it to the two who emailed the list, their postings
are in the archives at the URLs below:
The first query in the above list is something that I do have an interest in
I was just playing around with mjobctl -s and found that although it works
well with single CPU jobs it doesn't work with parallel (MPI) jobs. For
parallel jobs the processes on the mother superior only are suspended,
everything else carries on running.
Looking at it I believe that what's happening is that when PBS suspends a job
via the resume_suspend() function in resmom/requests.c it uses SIGSTOP, but
of course SIGSTOP is not catchable by the child processes so the program you
use to run the MPI program (mpiexec in our case) never sees it to propogate
it onto the other MPI processes.
In an ideal world the mom's would know all the processes on all nodes that
needed suspending and would send SIGSTOP to all of them simultaneously, but I
don't know if that's going to happen anytime soon. :-)
In this posting to the torque-users list:
Sebastien Georget wrote that he had patched mpiexec to propogate SIGTSTP to
the processes it starts, from which I presume he'd patched the pbs_mom to use
SIGTSTP rather than SIGSTOP for suspending processes.
I've sent a separate enquiry to the mpiexec list about whether there are any
plans to put this functionality into mpiexec as I feel that it would be
really useful (if pbs_mom is changed to use TSTP instead of STOP).
Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050405/69d22ff5/attachment.bin
More information about the torqueusers