[torqueusers] Node marked "free" but it has a job running
garrick at usc.edu
Thu Apr 13 04:50:43 MDT 2006
On Thu, Apr 13, 2006 at 05:42:20AM -0500, David McGiven alleged:
> Dear TORQUE Users,
> This morning I came to the office and I saw that my pbs_server was down.
> It had crashed some hours ago (I haven't found anything to debug the
> problem on the logs ... pitty). The jobs were still running normally on
> the pbs_mom nodes though.
> I then re-ran pbs_server and after a few minutes "qstat" was producing the
> correct output.
> However, when doing "pbsnodes -a" it lists the nodes marked as "free" but
> at the same time it says they are running a job (And if you ssh to them,
> they're really running the job).
> For example :
> state = free
> np = 1
> properties = cluster
> ntype = cluster
> jobs = 0/1195.machine.domain
> status = opsys=linux,uname=Linux node17 2.4.31.050901.nodes #1 Thu
> Sep 1 12:37:20 CEST 2005 i686,sessions=181
> That never happened to me. Whenever you ran a job in one node, it marked
> it as job-exclusive. Even if it didn't reach a load of 1.00
> What can I do to correct this ? I've tried :
> Qmgr: set node node17 state=job-exclusive
> qmgr obj=node17 svr=default: Operation not permitted
> I'm afraid next person who sends something to the queue will get his job
> running in one of the already calculating nodes.
Don't worry about it. It's not an issue. "free" means the node doesn't
have another state, ie: "busy" or "job-exclusive."
And job-exclusive doesn't mean anything. The scheduler doesn't look at
IIRC, newer TORQUE correctly preserves job-exclusive (despite it being
> Also, I saw on the PBS manual that there are some "ideal_load" and
> "max_load" parameters. That might work. However, modifying the config file
> on /var/spool/PBS/mom_priv requires a restart of pbs_mom, and that means
> I'll kill the running job.
momctl can send a new config file to pbs_mom, or use SIGHUP to cause
pbs_mom to reread it.
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060413/04a0ee73/attachment.bin
More information about the torqueusers