[torqueusers] SIGUSR1 results in a SIGTERM

Jeremy D Rogers jdrogers at northwestern.edu
Wed Mar 23 15:10:16 MDT 2011


[SOLVED]

On Wed, Mar 23, 2011 at 2:17 PM, David Beer <dbeer at adaptivecomputing.com> wrote:
>
>
> ----- Original Message -----
>> Ok, I have some more information.. I see from a 2 year old post
>> http://www.clusterresources.com/pipermail/torquedev/2009-March/001464.html
>> and a follow up in January of this year by David:
>> http://www.supercluster.org/pipermail/torqueusers/2011-January/011980.html
>> it appears that kill_delay was broken until torque-2.4.8.
>>
>> Now my cluster is torque-2.2.1, so that may be the problem, I'll try
>> newer torque and see if it works. On the university cluster, we have
>> torque-2.4.9, but despite the docs claiming a kill_delay of 120s,
>> "qmgr -c "list queue normal kill_delay" returns nothing, so it may be
>> unset.
>>
>
> I'm glad you found those posts. I had thought you said you were on 2.4.something and I couldn't find the exact version where it had changed, but I knew I had fixed a bug somewhere in the 2.4 series.

Hmph, seems to work even with torque-2.2.1 (see below).

>
> Also, kill_delay is set on the server instead of on queues, so you want to check the server to verify instead of the queue.

Indeed, you are correct. I guess the university cluster is set to 15s,
not what the docs lead me to believe at least its something.

>> David, thanks for your pointers. Just to be certain, to avoid the bash
>> problem, I should be able to put the signal handler in the submit
>> script, correct?
>>
>
> This wasn't enough for me. I had to place the signal handler in the .bashrc (or equivalent file for other shells) for it to work. I had to have it in both places. If you find different please let me know.

OOOH! I can't believe it! It never occurred to me that I might need to
try both at the same time. In case anyone finds this thread in the
future, to get the signals working:

Added these trap lines to BOTH .bashrc and my submit script just
before the mpirun command:
trap "echo bash caught SIGUSR1" USR1
trap "echo bash caught SIGUSR2" USR2

Still not working yet with the univ cluster, but at least I can make
headway and test against my cluster.
A huge THANK YOU David, I would have continued pulling my hair out for
more weeks without your leads.

Cheers,
JDR

>
> Good luck,
>
> David
>
>> Thanks,
>> JDR
>>
>> --
>> Jeremy D. Rogers, Ph.D.
>> Postdoctoral Fellow
>> Biomedical Engineering
>> Northwestern University
>>
>>
>>
>> On Tue, Mar 22, 2011 at 11:28 AM, Jeremy D Rogers
>> <jdrogers at northwestern.edu> wrote:
>> > On Tue, Mar 22, 2011 at 10:28 AM, David Beer
>> > <dbeer at adaptivecomputing.com> wrote:
>> >> Jeremy,
>> >>
>> >> This is sort of a tricky feature to get working in TORQUE.
>> >
>> > Apparently! :-)
>> >
>> >> First, do you have kill_delay set in qmgr?
>> >
>> > It's set to something like 60 seconds, but the SIGTERM is being send
>> > immediately.
>> >
>> >> Also, you need to make sure that the shell launched by TORQUE isn't
>> >> killed by the signal that you are sending, in this case SIGUSR1.
>> >> Each job is launched as the standard input to a shell, and
>> >> therefore the shell must be told to catch the signal as well as the
>> >> job. Often this is done by adding some signal handling code to the
>> >> defaults for whichever shell you're using for your jobs.
>> >
>> > Are you suggesting something in bashrc, or in the submit script? I
>> > saw
>> > something about that at
>> > http://www.open-mpi.org/faq/?category=running#qsub-notify and tried
>> > adding the signal handling to the submit script. I tried both adding
>> > 'exec' to the mpirun command.. I'm still a little confused as to why
>> > that was supposed to work. I guess exec is supposed to cause the
>> > script to exit after calling mpirun, so bash doesn't catch the
>> > SIGUSR1?
>> >
>> > I also tried putting in the signal handler function in the example
>> > into my submit script, but I never see any of the echo lines in my
>> > queue output file.
>> >
>> > I suspect you are right that bash is somewhere catching the sigusr1,
>> > so I'll see if I can make some headway there. If you have any more
>> > suggestions, I'm all ears.
>> > Thanks,
>> > JDR
>> >
>> >>
>> >> David
>> >>
>> >> ----- Original Message -----
>> >>> Hi all,
>> >>> I've been digging through the mailing list and docs, but I'm
>> >>> stumped.
>> >>> I'm trying to have my program write data and exit cleanly on
>> >>> receipt
>> >>> of SIGUSR1 (or any other signal for that matter).
>> >>>
>> >>> The program works as expected when run with mpirun, but when using
>> >>> the
>> >>> queue the job is killed with signal 15 right after receiving
>> >>> signal 10
>> >>> (or 12). This is true of a small cluster of mine running
>> >>> torque2.4.1
>> >>> as well as our university system running moab and I _think_
>> >>> torque249.. still trying to figure out how to tell definitively as
>> >>> a
>> >>> user.
>> >>>
>> >>> >From the docs, I gather that the queue manager should pass along
>> >>> signals by default, and it appears to be:
>> >>> $ qsub -l nodes=2:ppn=2 qsubmit.sh
>> >>> 4340.biophotonics1.bp1.loc
>> >>> $ qsig -s SIGUSR1 4340.biophotonics1.bp1.loc
>> >>> $ cat montecarlo.o4340
>> >>> mpirun: Forwarding signal 10 to job
>> >>> Caught SIGNAL 10 on proc 0, exiting..
>> >>> Caught SIGNAL 10 on proc 0, exiting..
>> >>> Caught SIGNAL 10 on proc 2, exiting..
>> >>> Caught SIGNAL 10 on proc 3, exiting..
>> >>> mpirun: killing job...
>> >>> Caught SIGNAL 15 on proc 0, exiting..
>> >>> --------------------------------------------------------------------------
>> >>> mpirun noticed that process rank 0 with PID 31855 on node bp1n2
>> >>> exited
>> >>> on signal 0 (Unknown signal 0).
>> >>> --------------------------------------------------------------------------
>> >>> Caught SIGNAL 15 on proc 2, exiting..
>> >>> Caught SIGNAL 15 on proc 1, exiting..
>> >>> Caught SIGNAL 10 on proc 1, exiting..
>> >>> Caught SIGNAL 15 on proc 3, exiting..
>> >>> mpirun: clean termination accomplished
>> >>> 4 total processes killed (some possibly by mpirun during cleanup)
>> >>>
>> >>> It appears that signal 10 is being forwarded properly and my
>> >>> program
>> >>> catches it and begins to exit, but then the server sends a SIGTERM
>> >>> which kills everything before my jobs can finish writing their
>> >>> data.
>> >>>
>> >>> Any suggestions on how to debug this would be appreciated.
>> >>> Thanks,
>> >>> JDR
>> >>>
>> >>> --
>> >>> Jeremy D. Rogers, Ph.D.
>> >>> Postdoctoral Fellow
>> >>> Biomedical Engineering
>> >>> Northwestern University
>> >>> _______________________________________________
>> >>> torqueusers mailing list
>> >>> torqueusers at supercluster.org
>> >>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> >>
>> >> --
>> >> David Beer
>> >> Direct Line: 801-717-3386 | Fax: 801-717-3738
>> >>     Adaptive Computing
>> >>     1656 S. East Bay Blvd. Suite #300
>> >>     Provo, UT 84606
>> >>
>> >> _______________________________________________
>> >> torqueusers mailing list
>> >> torqueusers at supercluster.org
>> >> http://www.supercluster.org/mailman/listinfo/torqueusers
>> >>
>> >
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
> --
> David Beer
> Direct Line: 801-717-3386 | Fax: 801-717-3738
>     Adaptive Computing
>     1656 S. East Bay Blvd. Suite #300
>     Provo, UT 84606
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>


More information about the torqueusers mailing list