[torqueusers] Signalling a job with qsig -s SIGUSR2 seems to TERM as well?

David Singleton David.Singleton at anu.edu.au
Fri Sep 1 19:10:28 MDT 2006

if [ x$PBS_ENVIRONMENT != x ]; then
    trap "" USR2

in the users .profile?

Garrick Staples wrote:
> On Wed, Aug 30, 2006 at 08:26:34PM -0600, Garrick Staples alleged:
>>Sounds like a bug.  I'll do some testing.
>>On Wed, Aug 30, 2006 at 06:11:48PM +0100, Atwood, Robert C alleged:
>>>Perhaps I do not understand how to use qsig -s , but it seems pretty
>>>straightforward from the man page and the documentation, if I send 
>>>qsig -s SIGUSR2 
>>>qsig -s USR2
>>>qsig -s 12  (on this system)
>>>It should pass the signal SIGUSR2 to the job? It does not mention also
>>>sending the SIGTERM signal as well, but that seems to happen on my
>>>installation using version: 2.1.2-snap.200607191251 with default
>>>scheduler  (OS is SUSE 10 based ClusterVisionOS (CVOS) on Intel EM64T
>>>processors )
>>>Bash version is GNU bash, version 3.00.16(1)-release (x86_64-suse-linux)
>>>If I run the following shell script (testsig):
>>>      1 #!/bin/bash
>>>      2 trap 'echo "Singal USR2 received";date' USR2
>>>      3 trap 'echo "Singal TERM received";date' TERM
>>>      4 trap -p
>>>      5 while [[ 1 ]]
>>>      6 do
>>>      7 a=1
>>>      8 done
>>>Using >% qsub testsig -q test -l walltime=1:00:00
>>>I get the following in the testsig.o#### stdout file:
>>>trap -- 'echo "Singal USR2 received";date' SIGUSR2
>>>trap -- 'echo "Singal TERM received";date' SIGTERM
>>>Singal USR2 received
>>>Wed Aug 30 17:18:37 BST 2006
>>>Singal TERM received
>>>Wed Aug 30 17:18:38 BST 2006
>>>And the job exits. However, when  running testsig from a command line,
>>>issuing the command 
>>> kill -USR2 %1 
>>>does not cause the job to exit. 
>>>THe effect in my real job script is that signalling the job via qsig
>>>does not allow the job to clean up its scratch files, for example,
>>>something like this pseudoscript:
>>>#PBS -l walltime=1:00:00
>>>(create a scratch dir on local disk)
>>>(copy files to scratch dir)
>>>(run the PROGRAM)
>>>(copy files to the master node)
>>>(delete the scratch dir)
>>>Despite implementing signal handlers in PROGRAM that work correctly
>>>ouside of Torque, signalling via qsig -s causes the job to terminate in
>>>the middle of the (run the PROGRAM) step , receiving both USR2 and TERM
>>>signals, and the following steps in the shell script are not executed. 
>>>Alternatly, I would like to send a signal to obtain intermediate results
>>>"on demand" from a lengthy job but sending the signal terminates the
>>>job, again despite signal handlers that work correctly ouside the torque
>>>What am I doing wrong or misunderstanding? Or do people recommend
>>>another way entirely to do what I want to do?
> I see what is happening now.  In a job, the process that is important is
> the top-level user shell that is the parent of the job script.  That
> means there is always at least 2 processes: the user's shell, and the
> user's script.
> When you send the signal to the job, it is sent to all processes in the
> job's process group, so both processes get the USR2.  Since the
> top-level shell isn't trapping the signal, it exits.  The script traps
> the signal and prints the message.
> And since the top-level shell exits, the job is exited.
> Anyone have any thoughts on solving this problem?
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

    Dr David Singleton               ANU Supercomputer Facility
    HPC Systems Manager              and APAC National Facility
    David.Singleton at anu.edu.au       Leonard Huxley Bldg (No. 56)
    Phone: +61 2 6125 4389           Australian National University
    Fax:   +61 2 6125 8199           Canberra, ACT, 0200, Australia

More information about the torqueusers mailing list