[torqueusers] taking node offline w/o killing running job

Garrick Staples garrick at usc.edu
Sat Jan 7 22:45:49 MST 2006


On Sat, Jan 07, 2006 at 11:42:51AM -0500, Andrew J Caird alleged:
> On Sat, 7 Jan 2006, Paul Raines wrote:
> 
> >
> >How does one take a node offline w/o killing running jobs.  I
> >want to make sure no new jobs get started on the node but
> >want to let the running jobs finish.  Doing 'pbsnodes -o' kills
> >any jobs on the nodes.  Once no jobs are on it, I want to apply
> >some updates needing a reboot and then I will put it back online.
> 
> We've used "qmgr -c 's n nodename state=offline'" in the past with 
> success.  But you can't set it back to 'busy' once you do.

Newer versions of TORQUE are much better about this.  The node state
handling was completely rewritten (I think it was 1.2.0p5).  You no
longer have any control over "busy" and "down".  Even if you change it
in qmgr, those two will always correct themselves.

Manually setting the node's state to offline in qmgr will overwrite the
existing states, but "busy" will come back again within a minute or so.
"job-exclusive" is lost and won't come back.

Using 'pbsnodes -o/-c' won't interfere with the other node states on
newer versions of TORQUE. 

$ pbsnodes -a hpcjr0006 | grep "state "
     state = job-exclusive,busy
$ pbsnodes -o hpcjr0006
$ pbsnodes -a hpcjr0006 | grep "state "
     state = offline,job-exclusive,busy
$ pbsnodes -c hpcjr0006
$ pbsnodes -a hpcjr0006 | grep "state "
     state = job-exclusive,busy

(comment lines trimmed for clarity)
Qmgr: p n hpcjr0006 state
set node hpcjr0006 state = job-exclusive
set node hpcjr0006 state += busy

Qmgr: s n hpcjr0006 state=offline
Qmgr: p n hpcjr0006 state
set node hpcjr0006 state = offline
(wait a minute)

Qmgr: p n hpcjr0006 state
set node hpcjr0006 state = offline
set node hpcjr0006 state += busy

And no, the job running on that node is never killed by marking it
offline.



-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060107/3ee9f98f/attachment.bin


More information about the torqueusers mailing list