[torqueusers] Signalling a job with qsig -s SIGUSR2 seems to TERM as well?

Garrick Staples garrick at clusterresources.com
Fri Sep 1 17:42:37 MDT 2006


On Wed, Aug 30, 2006 at 08:26:34PM -0600, Garrick Staples alleged:
> Sounds like a bug.  I'll do some testing.
> 
> On Wed, Aug 30, 2006 at 06:11:48PM +0100, Atwood, Robert C alleged:
> >  
> > Hi, 
> > Perhaps I do not understand how to use qsig -s , but it seems pretty
> > straightforward from the man page and the documentation, if I send 
> > qsig -s SIGUSR2 
> > Or 
> > qsig -s USR2
> > 
> > Or 
> > qsig -s 12  (on this system)
> > 
> > It should pass the signal SIGUSR2 to the job? It does not mention also
> > sending the SIGTERM signal as well, but that seems to happen on my
> > installation using version: 2.1.2-snap.200607191251 with default
> > scheduler  (OS is SUSE 10 based ClusterVisionOS (CVOS) on Intel EM64T
> > processors )
> > Bash version is GNU bash, version 3.00.16(1)-release (x86_64-suse-linux)
> > 
> > 
> > 
> > If I run the following shell script (testsig):
> > 
> > 
> >       1 #!/bin/bash
> >       2 trap 'echo "Singal USR2 received";date' USR2
> >       3 trap 'echo "Singal TERM received";date' TERM
> >       4 trap -p
> >       5 while [[ 1 ]]
> >       6 do
> >       7 a=1
> >       8 done
> > 
> > Using >% qsub testsig -q test -l walltime=1:00:00
> > 
> > I get the following in the testsig.o#### stdout file:
> > 
> > 
> > trap -- 'echo "Singal USR2 received";date' SIGUSR2
> > trap -- 'echo "Singal TERM received";date' SIGTERM
> > Singal USR2 received
> > Wed Aug 30 17:18:37 BST 2006
> > Singal TERM received
> > Wed Aug 30 17:18:38 BST 2006
> > 
> > And the job exits. However, when  running testsig from a command line,
> > issuing the command 
> >  kill -USR2 %1 
> > does not cause the job to exit. 
> > 
> > THe effect in my real job script is that signalling the job via qsig
> > does not allow the job to clean up its scratch files, for example,
> > something like this pseudoscript:
> > 
> > #!/bin/bash
> > #PBS -l walltime=1:00:00
> > 
> > (create a scratch dir on local disk)
> > (copy files to scratch dir)
> > (run the PROGRAM)
> > (copy files to the master node)
> > (delete the scratch dir)
> > 
> > Despite implementing signal handlers in PROGRAM that work correctly
> > ouside of Torque, signalling via qsig -s causes the job to terminate in
> > the middle of the (run the PROGRAM) step , receiving both USR2 and TERM
> > signals, and the following steps in the shell script are not executed. 
> > Alternatly, I would like to send a signal to obtain intermediate results
> > "on demand" from a lengthy job but sending the signal terminates the
> > job, again despite signal handlers that work correctly ouside the torque
> > context.
> > 
> > 
> > What am I doing wrong or misunderstanding? Or do people recommend
> > another way entirely to do what I want to do?

I see what is happening now.  In a job, the process that is important is
the top-level user shell that is the parent of the job script.  That
means there is always at least 2 processes: the user's shell, and the
user's script.

When you send the signal to the job, it is sent to all processes in the
job's process group, so both processes get the USR2.  Since the
top-level shell isn't trapping the signal, it exits.  The script traps
the signal and prints the message.

And since the top-level shell exits, the job is exited.

Anyone have any thoughts on solving this problem?



More information about the torqueusers mailing list