[torqueusers] Node marked "free" but it has a job running

Garrick Staples garrick at usc.edu
Thu Apr 13 04:50:43 MDT 2006


On Thu, Apr 13, 2006 at 05:42:20AM -0500, David McGiven alleged:
> 
> Dear TORQUE Users,
> 
> This morning I came to the office and I saw that my pbs_server was down.
> It had crashed some hours ago (I haven't found anything to debug the
> problem on the logs ... pitty). The jobs were still running normally on
> the pbs_mom nodes though.
> 
> I then re-ran pbs_server and after a few minutes "qstat" was producing the
> correct output.
> 
> However, when doing "pbsnodes -a" it lists the nodes marked as "free" but
> at the same time it says they are running a job (And if you ssh to them,
> they're really running the job).
> 
> For example :
> 
> node17
>      state = free
>      np = 1
>      properties = cluster
>      ntype = cluster
>      jobs = 0/1195.machine.domain
>      status = opsys=linux,uname=Linux node17 2.4.31.050901.nodes #1 Thu
> Sep 1 12:37:20 CEST 2005 i686,sessions=181
> 21727,nsessions=2,nusers=2,idletime=6815307,totmem=2857348kb,availmem=2357384kb,physmem=905460kb,ncpus=1,loadave=1.00,netload=3762816549,state=free,jobs=1195.bender.uab.es,rectime=1144924315
> 
> That never happened to me. Whenever you ran a job in one node, it marked
> it as job-exclusive. Even if it didn't reach a load of 1.00
> 
> What can I do to correct this ? I've tried :
> 
> Qmgr: set node node17 state=job-exclusive
> qmgr obj=node17 svr=default: Operation not permitted
> 
> I'm afraid next person who sends something to the queue will get his job
> running in one of the already calculating nodes.

Don't worry about it.  It's not an issue.  "free" means the node doesn't
have another state, ie: "busy" or "job-exclusive."  

And job-exclusive doesn't mean anything.  The scheduler doesn't look at
that.

IIRC, newer TORQUE correctly preserves job-exclusive (despite it being
meaningless.)


> Also, I saw on the PBS manual that there are some "ideal_load" and
> "max_load" parameters. That might work. However, modifying the config file
> on /var/spool/PBS/mom_priv requires a restart of pbs_mom, and that means
> I'll kill the running job.

momctl can send a new config file to pbs_mom, or use SIGHUP to cause
pbs_mom to reread it.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060413/04a0ee73/attachment.bin


More information about the torqueusers mailing list