[torqueusers] pbs scheduler crashing

Jack Fowler fowler at hep.brown.edu
Thu Feb 18 13:48:54 MST 2010


Hello All,
 Forgive me if this issue is in the archive but I've found no way to
search it. We're running torque/pbs 2.3.6 on our cluster of 25 nodes. In
response to a user problem, I've been running a simple test script that
crashes pbs_sched , bringing everything to a halt (as does the user's
job submission script).   In my script I can vary the number of dummy
jobs submitted. Below 160 or so, all jobs run to completion just fine.
Up to 200 is intermittent, above that I see constant scheduler crashes.
When it crashes, the /var/spool/torque/server_logs/logfile  always has a
similar output shown below. 

Any help is appreciated.
Thanks,
Jack

First my test script:

I=0
until [ $I -eq 201 ] ; do
echo "sleep 10" | qsub
let I=$I+1
done

And the log file excerpt:

 02/18/2010 08:09:05;0008;PBS_Server;Job;182392.brux.hep.brown.edu;Job
Modified at request of Scheduler at brux.hep.brown.edu
02/18/2010 08:09:05;0008;PBS_Server;Job;182392.brux.hep.brown.edu;Job
Run at request of Scheduler at brux.hep.brown.edu
02/18/2010 08:09:05;0008;PBS_Server;Job;182393.brux.hep.brown.edu;Job
Modified at request of Scheduler at brux.hep.brown.edu
02/18/2010 08:09:05;0008;PBS_Server;Job;182393.brux.hep.brown.edu;could
not locate requested resources '1#shared' (node_spec failed) cannot
allocate node 'master.hep.lo' to job - node not currently available (nps
needed/free: 1/-1,  joblist:
182324.brux.hep.brown.edu:0,182323.brux.hep.brown.edu:0,182322.brux.hep.
brown.edu:0,182321.brux.hep.brown.edu:0,182320.brux.hep.brown.edu:0)
02/18/2010 08:09:05;0080;PBS_Server;Req;req_reject;Reject reply
code=15044(Resource temporarily unavailable REJHOST=master.hep.lo
MSG=cannot allocate node 'master.hep.lo' to job - node not currently
available (nps needed/free: 1/-1,  joblist:
182324.brux.hep.brown.edu:0,182323.brux.hep.brown.edu:0,182322.brux.hep.
brown.edu:0,182321.brux.hep.brown.edu:0,182320.brux.hep.brown.edu:0)),
aux=0, type=RunJob, from Scheduler at brux.hep.brown.edu



More information about the torqueusers mailing list