[torqueusers] How to suspend a job ?

Sébastien Georget Sebastien.Georget@sophia.inria.fr
Mon, 19 Apr 2004 14:48:41 +0200

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit

Sébastien Georget wrote:
> Hi,
>   I would like to know if there an easy way to suspend an mpi job using 
> torque.
> I see two problems :
> 1/ send a signal to the 'master'
> 2/ send a signal to the 'slaves'
> 1/ I tried to use 'qsig -s SIGTSTP myjob'. The signal is sent to the 
> mom, then forwarded to the shell used to start the pbs script.
> The signal stops here, it seems that it is not send to the children of 
> the pbs script, is this the normal behaviour ? how to suspend an 
> application if that is the case ?
It seems that there was a problem during my first tests. The signal is 
correctly sent to all children.

> 2/ Will 'mpiexec' send the signal to all the host involded in the mpi 
> run ? Are there solutions to suspend all mpi process and not only the 
> master ?
By default mpiexec doesn't catch the SIGTSTP signal. What I have done is 
to write a small patch to catch the SIGTSTP signal and send a SIGSTOP to 
each mpi process.

Users can now suspend their job manually but they are still marked as 
running in the qstat output.
It seems that a job is marked as suspended when signaled with the 
'suspend' signal (qsig -s suspend jobid) but suspend = SIGSTOP.
Is it possible to change 'suspend' to SIGTSTP when the user is running a 
parallel job and leave it as SIGSTOP for sequential jobs ?

Sébastien Georget
INRIA Sophia-Antipolis, Service DREAM, B.P. 93
06902 Sophia-Antipolis Cedex, FRANCE
E-mail : sebastien.georget@sophia.inria.fr

Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org