[Mauiusers] Urgent jobs

Chris Samuel csamuel at vpac.org
Mon Apr 4 19:36:33 MDT 2005

On Tue, 5 Apr 2005 11:19 am, David Jackson wrote:

>   Please issue your pleas again.  We will always respond if we have
> bandwidth available.  Please send us a specific bug description and we
> will start will that.

It wasn't me, I've copied it to the two who emailed the list, their postings 
are in the archives at the URLs below:


The first query in the above list is something that I do have an interest in 

I was just playing around with mjobctl -s and found that although it works 
well with single CPU jobs it doesn't work with parallel (MPI) jobs.  For 
parallel jobs the processes on the mother superior only are suspended, 
everything else carries on running.

Looking at it I believe that what's happening is that when PBS suspends a job 
via the resume_suspend() function in resmom/requests.c it uses SIGSTOP, but 
of course SIGSTOP is not catchable by the child processes so the program you 
use to run the MPI program (mpiexec in our case) never sees it to propogate 
it onto the other MPI processes.

In an ideal world the mom's would know all the processes on all nodes that 
needed suspending and would send SIGSTOP to all of them simultaneously, but I 
don't know if that's going to happen anytime soon. :-)

In this posting to the torque-users list:


Sebastien Georget wrote that he had patched mpiexec to propogate SIGTSTP to 
the processes it starts, from which I presume he'd patched the pbs_mom to use 
SIGTSTP rather than SIGSTOP for suspending processes.

I've sent a separate enquiry to the mpiexec list about whether there are any 
plans to put this functionality into mpiexec as I feel that it would be 
really useful (if pbs_mom is changed to use TSTP instead of STOP).

 Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20050405/69d22ff5/attachment.bin

More information about the mauiusers mailing list