[torqueusers] Signalling a job with qsig -s SIGUSR2 seems to TERM as well?

Garrick Staples garrick at usc.edu
Fri Sep 1 19:12:53 MDT 2006


On Sat, Sep 02, 2006 at 11:10:28AM +1000, David Singleton alleged:
> 
> if [ x$PBS_ENVIRONMENT != x ]; then
>    trap "" USR2
> fi
> 
> in the users .profile?

That'll work, but I was thinking more along the lines of a generalized
solution within TORQUE.


 
> Garrick Staples wrote:
> >On Wed, Aug 30, 2006 at 08:26:34PM -0600, Garrick Staples alleged:
> >
> >>Sounds like a bug.  I'll do some testing.
> >>
> >>On Wed, Aug 30, 2006 at 06:11:48PM +0100, Atwood, Robert C alleged:
> >>
> >>>
> >>>Hi, 
> >>>Perhaps I do not understand how to use qsig -s , but it seems pretty
> >>>straightforward from the man page and the documentation, if I send 
> >>>qsig -s SIGUSR2 
> >>>Or 
> >>>qsig -s USR2
> >>>
> >>>Or 
> >>>qsig -s 12  (on this system)
> >>>
> >>>It should pass the signal SIGUSR2 to the job? It does not mention also
> >>>sending the SIGTERM signal as well, but that seems to happen on my
> >>>installation using version: 2.1.2-snap.200607191251 with default
> >>>scheduler  (OS is SUSE 10 based ClusterVisionOS (CVOS) on Intel EM64T
> >>>processors )
> >>>Bash version is GNU bash, version 3.00.16(1)-release (x86_64-suse-linux)
> >>>
> >>>
> >>>
> >>>If I run the following shell script (testsig):
> >>>
> >>>
> >>>     1 #!/bin/bash
> >>>     2 trap 'echo "Singal USR2 received";date' USR2
> >>>     3 trap 'echo "Singal TERM received";date' TERM
> >>>     4 trap -p
> >>>     5 while [[ 1 ]]
> >>>     6 do
> >>>     7 a=1
> >>>     8 done
> >>>
> >>>Using >% qsub testsig -q test -l walltime=1:00:00
> >>>
> >>>I get the following in the testsig.o#### stdout file:
> >>>
> >>>
> >>>trap -- 'echo "Singal USR2 received";date' SIGUSR2
> >>>trap -- 'echo "Singal TERM received";date' SIGTERM
> >>>Singal USR2 received
> >>>Wed Aug 30 17:18:37 BST 2006
> >>>Singal TERM received
> >>>Wed Aug 30 17:18:38 BST 2006
> >>>
> >>>And the job exits. However, when  running testsig from a command line,
> >>>issuing the command 
> >>>kill -USR2 %1 
> >>>does not cause the job to exit. 
> >>>
> >>>THe effect in my real job script is that signalling the job via qsig
> >>>does not allow the job to clean up its scratch files, for example,
> >>>something like this pseudoscript:
> >>>
> >>>#!/bin/bash
> >>>#PBS -l walltime=1:00:00
> >>>
> >>>(create a scratch dir on local disk)
> >>>(copy files to scratch dir)
> >>>(run the PROGRAM)
> >>>(copy files to the master node)
> >>>(delete the scratch dir)
> >>>
> >>>Despite implementing signal handlers in PROGRAM that work correctly
> >>>ouside of Torque, signalling via qsig -s causes the job to terminate in
> >>>the middle of the (run the PROGRAM) step , receiving both USR2 and TERM
> >>>signals, and the following steps in the shell script are not executed. 
> >>>Alternatly, I would like to send a signal to obtain intermediate results
> >>>"on demand" from a lengthy job but sending the signal terminates the
> >>>job, again despite signal handlers that work correctly ouside the torque
> >>>context.
> >>>
> >>>
> >>>What am I doing wrong or misunderstanding? Or do people recommend
> >>>another way entirely to do what I want to do?
> >
> >
> >I see what is happening now.  In a job, the process that is important is
> >the top-level user shell that is the parent of the job script.  That
> >means there is always at least 2 processes: the user's shell, and the
> >user's script.
> >
> >When you send the signal to the job, it is sent to all processes in the
> >job's process group, so both processes get the USR2.  Since the
> >top-level shell isn't trapping the signal, it exits.  The script traps
> >the signal and prints the message.
> >
> >And since the top-level shell exits, the job is exited.
> >
> >Anyone have any thoughts on solving this problem?
> >
> >_______________________________________________
> >torqueusers mailing list
> >torqueusers at supercluster.org
> >http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> -- 
> --------------------------------------------------------------------------
>    Dr David Singleton               ANU Supercomputer Facility
>    HPC Systems Manager              and APAC National Facility
>    David.Singleton at anu.edu.au       Leonard Huxley Bldg (No. 56)
>    Phone: +61 2 6125 4389           Australian National University
>    Fax:   +61 2 6125 8199           Canberra, ACT, 0200, Australia
> --------------------------------------------------------------------------
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060901/f2a3657f/attachment.bin


More information about the torqueusers mailing list