[torqueusers] SIGUSR1 results in a SIGTERM

Jeremy D Rogers jdrogers at northwestern.edu
Wed Mar 23 12:25:53 MDT 2011


Ok, I have some more information.. I see from a 2 year old post
http://www.clusterresources.com/pipermail/torquedev/2009-March/001464.html
and a follow up in January of this year by David:
http://www.supercluster.org/pipermail/torqueusers/2011-January/011980.html
it appears that kill_delay was broken until torque-2.4.8.

Now my cluster is torque-2.2.1, so that may be the problem, I'll try
newer torque and see if it works. On the university cluster, we have
torque-2.4.9, but despite the docs claiming a kill_delay of 120s,
"qmgr -c "list queue normal kill_delay" returns nothing, so it may be
unset.

David, thanks for your pointers. Just to be certain, to avoid the bash
problem, I should be able to put the signal handler in the submit
script, correct?

Thanks,
JDR

--
Jeremy D. Rogers, Ph.D.
Postdoctoral Fellow
Biomedical Engineering
Northwestern University



On Tue, Mar 22, 2011 at 11:28 AM, Jeremy D Rogers
<jdrogers at northwestern.edu> wrote:
> On Tue, Mar 22, 2011 at 10:28 AM, David Beer
> <dbeer at adaptivecomputing.com> wrote:
>> Jeremy,
>>
>> This is sort of a tricky feature to get working in TORQUE.
>
> Apparently! :-)
>
>> First, do you have kill_delay set in qmgr?
>
> It's set to something like 60 seconds, but the SIGTERM is being send
> immediately.
>
>> Also, you need to make sure that the shell launched by TORQUE isn't killed by the signal that you are sending, in this case SIGUSR1. Each job is launched as the standard input to a shell, and therefore the shell must be told to catch the signal as well as the job. Often this is done by adding some signal handling code to the defaults for whichever shell you're using for your jobs.
>
> Are you suggesting something in bashrc, or in the submit script? I saw
> something about that at
> http://www.open-mpi.org/faq/?category=running#qsub-notify and tried
> adding the signal handling to the submit script. I tried both adding
> 'exec' to the mpirun command.. I'm still a little confused as to why
> that was supposed to work. I guess exec is supposed to cause the
> script to exit after calling mpirun, so bash doesn't catch the
> SIGUSR1?
>
> I also tried putting in the signal handler function in the example
> into my submit script, but I never see any of the echo lines in my
> queue output file.
>
> I suspect you are right that bash is somewhere catching the sigusr1,
> so I'll see if I can make some headway there.  If you have any more
> suggestions, I'm all ears.
> Thanks,
> JDR
>
>>
>> David
>>
>> ----- Original Message -----
>>> Hi all,
>>> I've been digging through the mailing list and docs, but I'm stumped.
>>> I'm trying to have my program write data and exit cleanly on receipt
>>> of SIGUSR1 (or any other signal for that matter).
>>>
>>> The program works as expected when run with mpirun, but when using the
>>> queue the job is killed with signal 15 right after receiving signal 10
>>> (or 12). This is true of a small cluster of mine running torque2.4.1
>>> as well as our university system running moab and I _think_
>>> torque249.. still trying to figure out how to tell definitively as a
>>> user.
>>>
>>> >From the docs, I gather that the queue manager should pass along
>>> signals by default, and it appears to be:
>>> $ qsub -l nodes=2:ppn=2 qsubmit.sh
>>> 4340.biophotonics1.bp1.loc
>>> $ qsig -s SIGUSR1 4340.biophotonics1.bp1.loc
>>> $ cat montecarlo.o4340
>>> mpirun: Forwarding signal 10 to job
>>> Caught SIGNAL 10 on proc 0, exiting..
>>> Caught SIGNAL 10 on proc 0, exiting..
>>> Caught SIGNAL 10 on proc 2, exiting..
>>> Caught SIGNAL 10 on proc 3, exiting..
>>> mpirun: killing job...
>>> Caught SIGNAL 15 on proc 0, exiting..
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 0 with PID 31855 on node bp1n2 exited
>>> on signal 0 (Unknown signal 0).
>>> --------------------------------------------------------------------------
>>> Caught SIGNAL 15 on proc 2, exiting..
>>> Caught SIGNAL 15 on proc 1, exiting..
>>> Caught SIGNAL 10 on proc 1, exiting..
>>> Caught SIGNAL 15 on proc 3, exiting..
>>> mpirun: clean termination accomplished
>>> 4 total processes killed (some possibly by mpirun during cleanup)
>>>
>>> It appears that signal 10 is being forwarded properly and my program
>>> catches it and begins to exit, but then the server sends a SIGTERM
>>> which kills everything before my jobs can finish writing their data.
>>>
>>> Any suggestions on how to debug this would be appreciated.
>>> Thanks,
>>> JDR
>>>
>>> --
>>> Jeremy D. Rogers, Ph.D.
>>> Postdoctoral Fellow
>>> Biomedical Engineering
>>> Northwestern University
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>> --
>> David Beer
>> Direct Line: 801-717-3386 | Fax: 801-717-3738
>>     Adaptive Computing
>>     1656 S. East Bay Blvd. Suite #300
>>     Provo, UT 84606
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>


More information about the torqueusers mailing list