[torqueusers] SIGUSR1 results in a SIGTERM

Jeremy D Rogers jdrogers at northwestern.edu
Tue Mar 22 10:28:41 MDT 2011


On Tue, Mar 22, 2011 at 10:28 AM, David Beer
<dbeer at adaptivecomputing.com> wrote:
> Jeremy,
>
> This is sort of a tricky feature to get working in TORQUE.

Apparently! :-)

> First, do you have kill_delay set in qmgr?

It's set to something like 60 seconds, but the SIGTERM is being send
immediately.

> Also, you need to make sure that the shell launched by TORQUE isn't killed by the signal that you are sending, in this case SIGUSR1. Each job is launched as the standard input to a shell, and therefore the shell must be told to catch the signal as well as the job. Often this is done by adding some signal handling code to the defaults for whichever shell you're using for your jobs.

Are you suggesting something in bashrc, or in the submit script? I saw
something about that at
http://www.open-mpi.org/faq/?category=running#qsub-notify and tried
adding the signal handling to the submit script. I tried both adding
'exec' to the mpirun command.. I'm still a little confused as to why
that was supposed to work. I guess exec is supposed to cause the
script to exit after calling mpirun, so bash doesn't catch the
SIGUSR1?

I also tried putting in the signal handler function in the example
into my submit script, but I never see any of the echo lines in my
queue output file.

I suspect you are right that bash is somewhere catching the sigusr1,
so I'll see if I can make some headway there.  If you have any more
suggestions, I'm all ears.
Thanks,
JDR

>
> David
>
> ----- Original Message -----
>> Hi all,
>> I've been digging through the mailing list and docs, but I'm stumped.
>> I'm trying to have my program write data and exit cleanly on receipt
>> of SIGUSR1 (or any other signal for that matter).
>>
>> The program works as expected when run with mpirun, but when using the
>> queue the job is killed with signal 15 right after receiving signal 10
>> (or 12). This is true of a small cluster of mine running torque2.4.1
>> as well as our university system running moab and I _think_
>> torque249.. still trying to figure out how to tell definitively as a
>> user.
>>
>> >From the docs, I gather that the queue manager should pass along
>> signals by default, and it appears to be:
>> $ qsub -l nodes=2:ppn=2 qsubmit.sh
>> 4340.biophotonics1.bp1.loc
>> $ qsig -s SIGUSR1 4340.biophotonics1.bp1.loc
>> $ cat montecarlo.o4340
>> mpirun: Forwarding signal 10 to job
>> Caught SIGNAL 10 on proc 0, exiting..
>> Caught SIGNAL 10 on proc 0, exiting..
>> Caught SIGNAL 10 on proc 2, exiting..
>> Caught SIGNAL 10 on proc 3, exiting..
>> mpirun: killing job...
>> Caught SIGNAL 15 on proc 0, exiting..
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 0 with PID 31855 on node bp1n2 exited
>> on signal 0 (Unknown signal 0).
>> --------------------------------------------------------------------------
>> Caught SIGNAL 15 on proc 2, exiting..
>> Caught SIGNAL 15 on proc 1, exiting..
>> Caught SIGNAL 10 on proc 1, exiting..
>> Caught SIGNAL 15 on proc 3, exiting..
>> mpirun: clean termination accomplished
>> 4 total processes killed (some possibly by mpirun during cleanup)
>>
>> It appears that signal 10 is being forwarded properly and my program
>> catches it and begins to exit, but then the server sends a SIGTERM
>> which kills everything before my jobs can finish writing their data.
>>
>> Any suggestions on how to debug this would be appreciated.
>> Thanks,
>> JDR
>>
>> --
>> Jeremy D. Rogers, Ph.D.
>> Postdoctoral Fellow
>> Biomedical Engineering
>> Northwestern University
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
> --
> David Beer
> Direct Line: 801-717-3386 | Fax: 801-717-3738
>     Adaptive Computing
>     1656 S. East Bay Blvd. Suite #300
>     Provo, UT 84606
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>


More information about the torqueusers mailing list