[torqueusers] Torque free/job-exclusive confusion
Lennart Karlsson
Lennart.Karlsson at nsc.liu.se
Wed Jun 14 05:03:05 MDT 2006
Torque, versions 2.1.1-snap.200605191740 and 2.1.0p0-snap.200603212346,
does something that confuses me and also seems to confuse Maui and Moab,
making their scheduling more difficult than necessary:
It sometimes sets the state "free" on busy nodes.
I see this behaviour on both one-processor and two-processor (where both
processors are used) nodes.
E.g. do I get this output from 'pbsnodes -a n1':
n1
state = free
np = 1
ntype = cluster
jobs = 0/122033.green
status = opsys=linux,uname=Linux n1 2.4.21-27.0.2.EL-nird1 #1 Wed Feb 9
15:58:46 CET 2005 i686,sessions=4127,nsessions=1,nusers=1,idletime=19344652,tot
mem=4106380kb,av
ailmem=4017800kb,physmem=2058104kb,ncpus=1,loadave=0.99,netload=3709266365,stat
e=free, jobs=122033.green,rectime=1150279971
If I restart the pbs_server I get the (in my eyes) more sane:
n1
state = job-exclusive
np = 1
ntype = cluster
jobs = 0/122033.green
status = opsys=linux,uname=Linux n1 2.4.21-27.0.2.EL-nird1 #1 Wed Feb 9
15:58:46 CET 2005 i686,sessions=4127,nsessions=1,nusers=1,idletime=19346814,tot
mem=4106380kb,availmem=4017864kb,physmem=2058104kb,ncpus=1,loadave=0.99,netload
=3709431732,state=free,jobs=122033.green,rectime=1150282133
I do not really know when and why the node is declared "free". The job
122033 had run for a few hours before this happened, so it does not seem
to be a lost communication between pbs_mom and pbs_server.
Is there a fix to this problem? Or do someone of you have a nice idea about
how to find where the information mismatch is created? I assume that the
state information is set by the pbs_server? Or is it set by the scheduler
or the pbs_mom?
-- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
National Supercomputer Centre in Linkoping, Sweden
http://www.nsc.liu.se
More information about the torqueusers
mailing list