[torqueusers] Node marked "free" but it has a job running

David McGiven david.mcgiven at fusemail.com
Thu Apr 13 04:42:20 MDT 2006


Dear TORQUE Users,

This morning I came to the office and I saw that my pbs_server was down.
It had crashed some hours ago (I haven't found anything to debug the
problem on the logs ... pitty). The jobs were still running normally on
the pbs_mom nodes though.

I then re-ran pbs_server and after a few minutes "qstat" was producing the
correct output.

However, when doing "pbsnodes -a" it lists the nodes marked as "free" but
at the same time it says they are running a job (And if you ssh to them,
they're really running the job).

For example :

node17
     state = free
     np = 1
     properties = cluster
     ntype = cluster
     jobs = 0/1195.machine.domain
     status = opsys=linux,uname=Linux node17 2.4.31.050901.nodes #1 Thu
Sep 1 12:37:20 CEST 2005 i686,sessions=181
21727,nsessions=2,nusers=2,idletime=6815307,totmem=2857348kb,availmem=2357384kb,physmem=905460kb,ncpus=1,loadave=1.00,netload=3762816549,state=free,jobs=1195.bender.uab.es,rectime=1144924315

That never happened to me. Whenever you ran a job in one node, it marked
it as job-exclusive. Even if it didn't reach a load of 1.00

What can I do to correct this ? I've tried :

Qmgr: set node node17 state=job-exclusive
qmgr obj=node17 svr=default: Operation not permitted

I'm afraid next person who sends something to the queue will get his job
running in one of the already calculating nodes.

Also, I saw on the PBS manual that there are some "ideal_load" and
"max_load" parameters. That might work. However, modifying the config file
on /var/spool/PBS/mom_priv requires a restart of pbs_mom, and that means
I'll kill the running job.

Thanks in advance.

Best Regards,
David
CTO


More information about the torqueusers mailing list