[torqueusers] Signalling a job with qsig -s SIGUSR2 seems to
TERM as well?
David Singleton
David.Singleton at anu.edu.au
Fri Sep 1 19:10:28 MDT 2006
if [ x$PBS_ENVIRONMENT != x ]; then
trap "" USR2
fi
in the users .profile?
Garrick Staples wrote:
> On Wed, Aug 30, 2006 at 08:26:34PM -0600, Garrick Staples alleged:
>
>>Sounds like a bug. I'll do some testing.
>>
>>On Wed, Aug 30, 2006 at 06:11:48PM +0100, Atwood, Robert C alleged:
>>
>>>
>>>Hi,
>>>Perhaps I do not understand how to use qsig -s , but it seems pretty
>>>straightforward from the man page and the documentation, if I send
>>>qsig -s SIGUSR2
>>>Or
>>>qsig -s USR2
>>>
>>>Or
>>>qsig -s 12 (on this system)
>>>
>>>It should pass the signal SIGUSR2 to the job? It does not mention also
>>>sending the SIGTERM signal as well, but that seems to happen on my
>>>installation using version: 2.1.2-snap.200607191251 with default
>>>scheduler (OS is SUSE 10 based ClusterVisionOS (CVOS) on Intel EM64T
>>>processors )
>>>Bash version is GNU bash, version 3.00.16(1)-release (x86_64-suse-linux)
>>>
>>>
>>>
>>>If I run the following shell script (testsig):
>>>
>>>
>>> 1 #!/bin/bash
>>> 2 trap 'echo "Singal USR2 received";date' USR2
>>> 3 trap 'echo "Singal TERM received";date' TERM
>>> 4 trap -p
>>> 5 while [[ 1 ]]
>>> 6 do
>>> 7 a=1
>>> 8 done
>>>
>>>Using >% qsub testsig -q test -l walltime=1:00:00
>>>
>>>I get the following in the testsig.o#### stdout file:
>>>
>>>
>>>trap -- 'echo "Singal USR2 received";date' SIGUSR2
>>>trap -- 'echo "Singal TERM received";date' SIGTERM
>>>Singal USR2 received
>>>Wed Aug 30 17:18:37 BST 2006
>>>Singal TERM received
>>>Wed Aug 30 17:18:38 BST 2006
>>>
>>>And the job exits. However, when running testsig from a command line,
>>>issuing the command
>>> kill -USR2 %1
>>>does not cause the job to exit.
>>>
>>>THe effect in my real job script is that signalling the job via qsig
>>>does not allow the job to clean up its scratch files, for example,
>>>something like this pseudoscript:
>>>
>>>#!/bin/bash
>>>#PBS -l walltime=1:00:00
>>>
>>>(create a scratch dir on local disk)
>>>(copy files to scratch dir)
>>>(run the PROGRAM)
>>>(copy files to the master node)
>>>(delete the scratch dir)
>>>
>>>Despite implementing signal handlers in PROGRAM that work correctly
>>>ouside of Torque, signalling via qsig -s causes the job to terminate in
>>>the middle of the (run the PROGRAM) step , receiving both USR2 and TERM
>>>signals, and the following steps in the shell script are not executed.
>>>Alternatly, I would like to send a signal to obtain intermediate results
>>>"on demand" from a lengthy job but sending the signal terminates the
>>>job, again despite signal handlers that work correctly ouside the torque
>>>context.
>>>
>>>
>>>What am I doing wrong or misunderstanding? Or do people recommend
>>>another way entirely to do what I want to do?
>
>
> I see what is happening now. In a job, the process that is important is
> the top-level user shell that is the parent of the job script. That
> means there is always at least 2 processes: the user's shell, and the
> user's script.
>
> When you send the signal to the job, it is sent to all processes in the
> job's process group, so both processes get the USR2. Since the
> top-level shell isn't trapping the signal, it exits. The script traps
> the signal and prints the message.
>
> And since the top-level shell exits, the job is exited.
>
> Anyone have any thoughts on solving this problem?
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
--
--------------------------------------------------------------------------
Dr David Singleton ANU Supercomputer Facility
HPC Systems Manager and APAC National Facility
David.Singleton at anu.edu.au Leonard Huxley Bldg (No. 56)
Phone: +61 2 6125 4389 Australian National University
Fax: +61 2 6125 8199 Canberra, ACT, 0200, Australia
--------------------------------------------------------------------------
More information about the torqueusers
mailing list