[torqueusers] SIGUSR1 results in a SIGTERM

David Beer dbeer at adaptivecomputing.com
Wed Mar 23 13:17:00 MDT 2011



----- Original Message -----
> Ok, I have some more information.. I see from a 2 year old post
> http://www.clusterresources.com/pipermail/torquedev/2009-March/001464.html
> and a follow up in January of this year by David:
> http://www.supercluster.org/pipermail/torqueusers/2011-January/011980.html
> it appears that kill_delay was broken until torque-2.4.8.
> 
> Now my cluster is torque-2.2.1, so that may be the problem, I'll try
> newer torque and see if it works. On the university cluster, we have
> torque-2.4.9, but despite the docs claiming a kill_delay of 120s,
> "qmgr -c "list queue normal kill_delay" returns nothing, so it may be
> unset.
> 

I'm glad you found those posts. I had thought you said you were on 2.4.something and I couldn't find the exact version where it had changed, but I knew I had fixed a bug somewhere in the 2.4 series. 

Also, kill_delay is set on the server instead of on queues, so you want to check the server to verify instead of the queue.

> David, thanks for your pointers. Just to be certain, to avoid the bash
> problem, I should be able to put the signal handler in the submit
> script, correct?
> 

This wasn't enough for me. I had to place the signal handler in the .bashrc (or equivalent file for other shells) for it to work. I had to have it in both places. If you find different please let me know.

Good luck,

David

> Thanks,
> JDR
> 
> --
> Jeremy D. Rogers, Ph.D.
> Postdoctoral Fellow
> Biomedical Engineering
> Northwestern University
> 
> 
> 
> On Tue, Mar 22, 2011 at 11:28 AM, Jeremy D Rogers
> <jdrogers at northwestern.edu> wrote:
> > On Tue, Mar 22, 2011 at 10:28 AM, David Beer
> > <dbeer at adaptivecomputing.com> wrote:
> >> Jeremy,
> >>
> >> This is sort of a tricky feature to get working in TORQUE.
> >
> > Apparently! :-)
> >
> >> First, do you have kill_delay set in qmgr?
> >
> > It's set to something like 60 seconds, but the SIGTERM is being send
> > immediately.
> >
> >> Also, you need to make sure that the shell launched by TORQUE isn't
> >> killed by the signal that you are sending, in this case SIGUSR1.
> >> Each job is launched as the standard input to a shell, and
> >> therefore the shell must be told to catch the signal as well as the
> >> job. Often this is done by adding some signal handling code to the
> >> defaults for whichever shell you're using for your jobs.
> >
> > Are you suggesting something in bashrc, or in the submit script? I
> > saw
> > something about that at
> > http://www.open-mpi.org/faq/?category=running#qsub-notify and tried
> > adding the signal handling to the submit script. I tried both adding
> > 'exec' to the mpirun command.. I'm still a little confused as to why
> > that was supposed to work. I guess exec is supposed to cause the
> > script to exit after calling mpirun, so bash doesn't catch the
> > SIGUSR1?
> >
> > I also tried putting in the signal handler function in the example
> > into my submit script, but I never see any of the echo lines in my
> > queue output file.
> >
> > I suspect you are right that bash is somewhere catching the sigusr1,
> > so I'll see if I can make some headway there. If you have any more
> > suggestions, I'm all ears.
> > Thanks,
> > JDR
> >
> >>
> >> David
> >>
> >> ----- Original Message -----
> >>> Hi all,
> >>> I've been digging through the mailing list and docs, but I'm
> >>> stumped.
> >>> I'm trying to have my program write data and exit cleanly on
> >>> receipt
> >>> of SIGUSR1 (or any other signal for that matter).
> >>>
> >>> The program works as expected when run with mpirun, but when using
> >>> the
> >>> queue the job is killed with signal 15 right after receiving
> >>> signal 10
> >>> (or 12). This is true of a small cluster of mine running
> >>> torque2.4.1
> >>> as well as our university system running moab and I _think_
> >>> torque249.. still trying to figure out how to tell definitively as
> >>> a
> >>> user.
> >>>
> >>> >From the docs, I gather that the queue manager should pass along
> >>> signals by default, and it appears to be:
> >>> $ qsub -l nodes=2:ppn=2 qsubmit.sh
> >>> 4340.biophotonics1.bp1.loc
> >>> $ qsig -s SIGUSR1 4340.biophotonics1.bp1.loc
> >>> $ cat montecarlo.o4340
> >>> mpirun: Forwarding signal 10 to job
> >>> Caught SIGNAL 10 on proc 0, exiting..
> >>> Caught SIGNAL 10 on proc 0, exiting..
> >>> Caught SIGNAL 10 on proc 2, exiting..
> >>> Caught SIGNAL 10 on proc 3, exiting..
> >>> mpirun: killing job...
> >>> Caught SIGNAL 15 on proc 0, exiting..
> >>> --------------------------------------------------------------------------
> >>> mpirun noticed that process rank 0 with PID 31855 on node bp1n2
> >>> exited
> >>> on signal 0 (Unknown signal 0).
> >>> --------------------------------------------------------------------------
> >>> Caught SIGNAL 15 on proc 2, exiting..
> >>> Caught SIGNAL 15 on proc 1, exiting..
> >>> Caught SIGNAL 10 on proc 1, exiting..
> >>> Caught SIGNAL 15 on proc 3, exiting..
> >>> mpirun: clean termination accomplished
> >>> 4 total processes killed (some possibly by mpirun during cleanup)
> >>>
> >>> It appears that signal 10 is being forwarded properly and my
> >>> program
> >>> catches it and begins to exit, but then the server sends a SIGTERM
> >>> which kills everything before my jobs can finish writing their
> >>> data.
> >>>
> >>> Any suggestions on how to debug this would be appreciated.
> >>> Thanks,
> >>> JDR
> >>>
> >>> --
> >>> Jeremy D. Rogers, Ph.D.
> >>> Postdoctoral Fellow
> >>> Biomedical Engineering
> >>> Northwestern University
> >>> _______________________________________________
> >>> torqueusers mailing list
> >>> torqueusers at supercluster.org
> >>> http://www.supercluster.org/mailman/listinfo/torqueusers
> >>
> >> --
> >> David Beer
> >> Direct Line: 801-717-3386 | Fax: 801-717-3738
> >>     Adaptive Computing
> >>     1656 S. East Bay Blvd. Suite #300
> >>     Provo, UT 84606
> >>
> >> _______________________________________________
> >> torqueusers mailing list
> >> torqueusers at supercluster.org
> >> http://www.supercluster.org/mailman/listinfo/torqueusers
> >>
> >
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-- 
David Beer 
Direct Line: 801-717-3386 | Fax: 801-717-3738
     Adaptive Computing
     1656 S. East Bay Blvd. Suite #300
     Provo, UT 84606



More information about the torqueusers mailing list