[torqueusers] SIGUSR1 results in a SIGTERM

David Beer dbeer at adaptivecomputing.com
Tue Mar 22 09:28:47 MDT 2011


This is sort of a tricky feature to get working in TORQUE. First, do you have kill_delay set in qmgr? Also, you need to make sure that the shell launched by TORQUE isn't killed by the signal that you are sending, in this case SIGUSR1. Each job is launched as the standard input to a shell, and therefore the shell must be told to catch the signal as well as the job. Often this is done by adding some signal handling code to the defaults for whichever shell you're using for your jobs.


----- Original Message -----
> Hi all,
> I've been digging through the mailing list and docs, but I'm stumped.
> I'm trying to have my program write data and exit cleanly on receipt
> of SIGUSR1 (or any other signal for that matter).
> The program works as expected when run with mpirun, but when using the
> queue the job is killed with signal 15 right after receiving signal 10
> (or 12). This is true of a small cluster of mine running torque2.4.1
> as well as our university system running moab and I _think_
> torque249.. still trying to figure out how to tell definitively as a
> user.
> >From the docs, I gather that the queue manager should pass along
> signals by default, and it appears to be:
> $ qsub -l nodes=2:ppn=2 qsubmit.sh
> 4340.biophotonics1.bp1.loc
> $ qsig -s SIGUSR1 4340.biophotonics1.bp1.loc
> $ cat montecarlo.o4340
> mpirun: Forwarding signal 10 to job
> Caught SIGNAL 10 on proc 0, exiting..
> Caught SIGNAL 10 on proc 0, exiting..
> Caught SIGNAL 10 on proc 2, exiting..
> Caught SIGNAL 10 on proc 3, exiting..
> mpirun: killing job...
> Caught SIGNAL 15 on proc 0, exiting..
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 31855 on node bp1n2 exited
> on signal 0 (Unknown signal 0).
> --------------------------------------------------------------------------
> Caught SIGNAL 15 on proc 2, exiting..
> Caught SIGNAL 15 on proc 1, exiting..
> Caught SIGNAL 10 on proc 1, exiting..
> Caught SIGNAL 15 on proc 3, exiting..
> mpirun: clean termination accomplished
> 4 total processes killed (some possibly by mpirun during cleanup)
> It appears that signal 10 is being forwarded properly and my program
> catches it and begins to exit, but then the server sends a SIGTERM
> which kills everything before my jobs can finish writing their data.
> Any suggestions on how to debug this would be appreciated.
> Thanks,
> --
> Jeremy D. Rogers, Ph.D.
> Postdoctoral Fellow
> Biomedical Engineering
> Northwestern University
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

David Beer 
Direct Line: 801-717-3386 | Fax: 801-717-3738
     Adaptive Computing
     1656 S. East Bay Blvd. Suite #300
     Provo, UT 84606

More information about the torqueusers mailing list