[torqueusers] Torque free/job-exclusive confusion

Lennart Karlsson Lennart.Karlsson at nsc.liu.se
Wed Jun 14 05:03:05 MDT 2006


Torque, versions 2.1.1-snap.200605191740 and 2.1.0p0-snap.200603212346,
does something that confuses me and also seems to confuse Maui and Moab,
making their scheduling more difficult than necessary:
It sometimes sets the state "free" on busy nodes.

I see this behaviour on both one-processor and two-processor (where both
processors are used) nodes.

E.g. do I get this output from 'pbsnodes -a n1':
n1
     state = free
     np = 1
     ntype = cluster
     jobs = 0/122033.green
     status = opsys=linux,uname=Linux n1 2.4.21-27.0.2.EL-nird1 #1 Wed Feb 9 
15:58:46 CET 2005 i686,sessions=4127,nsessions=1,nusers=1,idletime=19344652,tot
mem=4106380kb,av
ailmem=4017800kb,physmem=2058104kb,ncpus=1,loadave=0.99,netload=3709266365,stat
e=free, jobs=122033.green,rectime=1150279971


If I restart the pbs_server I get the (in my eyes) more sane:
n1
     state = job-exclusive
     np = 1
     ntype = cluster
     jobs = 0/122033.green
     status = opsys=linux,uname=Linux n1 2.4.21-27.0.2.EL-nird1 #1 Wed Feb 9 
15:58:46 CET 2005 i686,sessions=4127,nsessions=1,nusers=1,idletime=19346814,tot
mem=4106380kb,availmem=4017864kb,physmem=2058104kb,ncpus=1,loadave=0.99,netload
=3709431732,state=free,jobs=122033.green,rectime=1150282133


I do not really know when and why the node is declared "free". The job
122033 had run for a few hours before this happened, so it does not seem
to be a lost communication between pbs_mom and pbs_server.

Is there a fix to this problem? Or do someone of you have a nice idea about
how to find where the information mismatch is created? I assume that the
state information is set by the pbs_server? Or is it set by the scheduler
or the pbs_mom?

-- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
   National Supercomputer Centre in Linkoping, Sweden
   http://www.nsc.liu.se




More information about the torqueusers mailing list