[torqueusers] Signalling a job with qsig -s SIGUSR2 seems toTERM as well?

Atwood, Robert C r.atwood at imperial.ac.uk
Mon Sep 4 11:36:30 MDT 2006


> Garrick Staples says
>
> 
> On Sat, Sep 02, 2006 at 11:10:28AM +1000, David Singleton alleged:
> > 
> > if [ x$PBS_ENVIRONMENT != x ]; then
> >    trap "" USR2
> > fi
> > 
> > in the users .profile?
> 
> That'll work, but I was thinking more along the lines of a generalized
> solution within TORQUE.
> 
Yes it works and may prove useful, for now...

I don't understand why putting  trap "" USR2 inside the script that is
submitted to qsub does not work, though.

However, users may be using more than one software package in the queue,
which may have different requirements for signal passing. At present,
though, I am not aware of any that my users are actually using except
the ones I wrote, so perhaps I could safely add this to everyone's
environment. 

I would  probably regret it later when some users need to have it and
some users need to *not* have it, or even the same users need to have it
on or off  for different packages.


To clarify, here's how I've been using torque, and openpbs before that,
and gnqs before that, and nqs before that ... I provide a (package)_q
script for each commonly used package and install on the common program
file area. The students then run this with the minimum of arguments,
several arguments that are specific to the package are set up in the
(package)_q script. This script calls qsub to submit another script
(package)_run , usually with some -v arguments to pass environment
variables. Most users never need to change anything except the requested
time and the name of the input file for the job so they can get on with
deciding what inputs to use and how to interpret the outputs instead of
figuring out the use of qsub and torque/pbs/gnqs/whatever and how to
trap signals in bash.
If the trap needs to be set at the .profile, then the unfortunate user
who needs to use my package and also some other package that handles a
different set of signals would have to switch profiles constantly. 

Perhaps there's a better way for me to  organizing the whole process
(I'll listen to suggestions but please don't recommend demanding that
each student learn all the details of the queueing system), but a
solution that can be triggered from inside either the (package)_q or
(package)_run script would be ideal. I gather from the comment above
that Garrick already agrees with this or similar logic for a solution
within Torque. I'm not sure how to helpfully suggest any way to do it
though, either a flag to qsub or a #directive would be fine ... What do
other queue systems do? I've used the ones mentioned above but I had not
tried writing my own program with signal handlers at that time.

Thanks 
Robert





-----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of David
Singleton
Sent: 02 September 2006 02:10
To: torqueusers at supercluster.org
Subject: Re: [torqueusers] Signalling a job with qsig -s SIGUSR2 seems
toTERM as well?


if [ x$PBS_ENVIRONMENT != x ]; then
    trap "" USR2
fi

in the users .profile?

Garrick Staples wrote:
> On Wed, Aug 30, 2006 at 08:26:34PM -0600, Garrick Staples alleged:
> 
>>Sounds like a bug.  I'll do some testing.
>>
>>On Wed, Aug 30, 2006 at 06:11:48PM +0100, Atwood, Robert C alleged:
>>
>>> 
>>>Hi, 
>>>Perhaps I do not understand how to use qsig -s , but it seems pretty
>>>straightforward from the man page and the documentation, if I send 
>>>qsig -s SIGUSR2 
>>>Or 
>>>qsig -s USR2
>>>
>>>Or 
>>>qsig -s 12  (on this system)
>>>
>>>It should pass the signal SIGUSR2 to the job? It does not mention
also
>>>sending the SIGTERM signal as well, but that seems to happen on my
>>>installation using version: 2.1.2-snap.200607191251 with default
>>>scheduler  (OS is SUSE 10 based ClusterVisionOS (CVOS) on Intel EM64T
>>>processors )
>>>Bash version is GNU bash, version 3.00.16(1)-release
(x86_64-suse-linux)
>>>
>>>
>>>
>>>If I run the following shell script (testsig):
>>>
>>>
>>>      1 #!/bin/bash
>>>      2 trap 'echo "Singal USR2 received";date' USR2
>>>      3 trap 'echo "Singal TERM received";date' TERM
>>>      4 trap -p
>>>      5 while [[ 1 ]]
>>>      6 do
>>>      7 a=1
>>>      8 done
>>>
>>>Using >% qsub testsig -q test -l walltime=1:00:00
>>>
>>>I get the following in the testsig.o#### stdout file:
>>>
>>>
>>>trap -- 'echo "Singal USR2 received";date' SIGUSR2
>>>trap -- 'echo "Singal TERM received";date' SIGTERM
>>>Singal USR2 received
>>>Wed Aug 30 17:18:37 BST 2006
>>>Singal TERM received
>>>Wed Aug 30 17:18:38 BST 2006
>>>
>>>And the job exits. However, when  running testsig from a command
line,
>>>issuing the command 
>>> kill -USR2 %1 
>>>does not cause the job to exit. 
>>>
>>>THe effect in my real job script is that signalling the job via qsig
>>>does not allow the job to clean up its scratch files, for example,
>>>something like this pseudoscript:
>>>
>>>#!/bin/bash
>>>#PBS -l walltime=1:00:00
>>>
>>>(create a scratch dir on local disk)
>>>(copy files to scratch dir)
>>>(run the PROGRAM)
>>>(copy files to the master node)
>>>(delete the scratch dir)
>>>
>>>Despite implementing signal handlers in PROGRAM that work correctly
>>>ouside of Torque, signalling via qsig -s causes the job to terminate
in
>>>the middle of the (run the PROGRAM) step , receiving both USR2 and
TERM
>>>signals, and the following steps in the shell script are not
executed. 
>>>Alternatly, I would like to send a signal to obtain intermediate
results
>>>"on demand" from a lengthy job but sending the signal terminates the
>>>job, again despite signal handlers that work correctly ouside the
torque
>>>context.
>>>
>>>
>>>What am I doing wrong or misunderstanding? Or do people recommend
>>>another way entirely to do what I want to do?
> 
> 
> I see what is happening now.  In a job, the process that is important
is
> the top-level user shell that is the parent of the job script.  That
> means there is always at least 2 processes: the user's shell, and the
> user's script.
> 
> When you send the signal to the job, it is sent to all processes in
the
> job's process group, so both processes get the USR2.  Since the
> top-level shell isn't trapping the signal, it exits.  The script traps
> the signal and prints the message.
> 
> And since the top-level shell exits, the job is exited.
> 
> Anyone have any thoughts on solving this problem?
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


-- 
------------------------------------------------------------------------
--
    Dr David Singleton               ANU Supercomputer Facility
    HPC Systems Manager              and APAC National Facility
    David.Singleton at anu.edu.au       Leonard Huxley Bldg (No. 56)
    Phone: +61 2 6125 4389           Australian National University
    Fax:   +61 2 6125 8199           Canberra, ACT, 0200, Australia
------------------------------------------------------------------------
--
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list