[torqueusers] pbs_sched and torque-2.3.7
c.tinker at niwa.co.nz
Tue Oct 20 20:07:21 MDT 2009
I have been trying to run Torque-2.3.7 with PBS_SCHED on a single server (4 dual-core CPU) running Red Hat Enterprise Linux Server release 5.3 and had problems with the scheduler crashing with
*** buffer overflow detected ***: /usr/local/sbin/pbs_sched terminated
Compiling and running it with gdb, I get:
*** stack smashing detected ***: /usr/local/sbin/pbs_sched terminated
Program received signal SIGABRT, Aborted.
0x0000003d1e230215 in raise () from /lib64/libc.so.6
#0 0x0000003d1e230215 in raise () from /lib64/libc.so.6
#1 0x0000003d1e231cc0 in abort () from /lib64/libc.so.6
#2 0x0000003d1e26a7fb in __libc_message () from /lib64/libc.so.6
#3 0x0000003d1e2e824f in __stack_chk_fail () from /lib64/libc.so.6
#4 0x0000000000405c9b in run_update_job (pbs_sd=0, sinfo=0xe7c8060, qinfo=0xe7c96f0, jinfo=0xe7cddf0) at fifo.c:696
#5 0x0000000000405500 in scheduling_cycle (sd=Cannot access memory at address 0x392c303a7a6e2d6b
) at fifo.c:463
Cannot access memory at address 0x392c303a7a6e2e77
After a bit more research I have found that this occurs if the queues are set up to allow more jobs to run than the value specified by np= in the ....server_priv/nodes configuration file.
My programing skills let me down a bit at this point, and the s/w is sufficiently complex for me to get any further at the moment, but it seems that the problem occurs when there are more jobs eligible to run than the value specified by the np value.
pbs_sched starts job np+1, crashes with a buffer overflow. I suspect that at this point the scheduler has requested or has been waiting for data to be returned to it from the mom or the server process and because the number of running jobs is greater than that specified by np is receiving an invalid value which results in a buffer overflow error.
I haven't found anything in the manuals that suggest that the queues must be configured to prevent more jobs from being eligible to run than that specified by np - but it seems to be necessary to prevent the above.
This looks similar to:
"[torqueusers] Stack corruption in pbs_sched 2.1.8?" reported in July 2007. (http://www.clusterresources.com/pipermail/torqueusers/2007-July/005840.html)
I know that Maui is the recommended scheduler, but the use of pbs_sched is imposed on me by other constraints.
Wondering if anyone else has this problem or does anyone have torque working with pbs_sched in a similar environment?
NIWA is the trading name of the National Institute of Water & Atmospheric Research Ltd.
More information about the torqueusers