[torqueusers] Node marked "free" but it has a job running
David McGiven
david.mcgiven at fusemail.com
Thu Apr 13 04:42:20 MDT 2006
Dear TORQUE Users,
This morning I came to the office and I saw that my pbs_server was down.
It had crashed some hours ago (I haven't found anything to debug the
problem on the logs ... pitty). The jobs were still running normally on
the pbs_mom nodes though.
I then re-ran pbs_server and after a few minutes "qstat" was producing the
correct output.
However, when doing "pbsnodes -a" it lists the nodes marked as "free" but
at the same time it says they are running a job (And if you ssh to them,
they're really running the job).
For example :
node17
state = free
np = 1
properties = cluster
ntype = cluster
jobs = 0/1195.machine.domain
status = opsys=linux,uname=Linux node17 2.4.31.050901.nodes #1 Thu
Sep 1 12:37:20 CEST 2005 i686,sessions=181
21727,nsessions=2,nusers=2,idletime=6815307,totmem=2857348kb,availmem=2357384kb,physmem=905460kb,ncpus=1,loadave=1.00,netload=3762816549,state=free,jobs=1195.bender.uab.es,rectime=1144924315
That never happened to me. Whenever you ran a job in one node, it marked
it as job-exclusive. Even if it didn't reach a load of 1.00
What can I do to correct this ? I've tried :
Qmgr: set node node17 state=job-exclusive
qmgr obj=node17 svr=default: Operation not permitted
I'm afraid next person who sends something to the queue will get his job
running in one of the already calculating nodes.
Also, I saw on the PBS manual that there are some "ideal_load" and
"max_load" parameters. That might work. However, modifying the config file
on /var/spool/PBS/mom_priv requires a restart of pbs_mom, and that means
I'll kill the running job.
Thanks in advance.
Best Regards,
David
CTO
More information about the torqueusers
mailing list