[torqueusers] Getting Torque 2.x/Maui 3.x to work on CentOS 6?

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Tue Nov 20 05:45:41 MST 2012


We're upgrading our cluster to CentOS 6.3 and would like to run recent 
versions of Torque 2.x and Maui 3.x.  Unfortunately, it seems that with 
Torque version 2.5.12 and Maui (version 3.3.1 as well as 3.2.6p21 
tested), Maui often wants to run jobs on nodes that are already full 
with running jobs.  It seems that we have the same problem as described 
in this July 2012 posting (with no solution posted):
http://www.supercluster.org/pipermail/torqueusers/2012-July/014848.html

QUESTION:
---------
Can anyone recommend versions of Torque 2.x and Maui 3.x which have been 
demonstrated to work correctly on CentOS 6.x?

Our older cluster installation runs Torque 2.3.7 and Maui 3.2.6p21 on 
CentOS 5.3 and works like a charm.  Perhaps we should go back to Torque 
2.3 on CentOS 6 as well?

Further details:
----------------

For example, we're seeing this strange behavior where Maui wants to 
schedule job 1891 to run on node g042 (which is already full running job 
1984). The checkjob command claims that node g042 has a joblist 
containing job 1890, which was already completed.  There are many other 
free nodes available for scheduling jobs.

# checkjob 1891
...
Holds:    Defer
Messages:  cannot start job - RM failure, rc: 15046, msg: 'Resource 
temporarily unavailable REJHOST=g042 MSG=cannot allocate node 'g042' to 
job - node not currently available (nps needed/free: 16/0, gpus 
needed/free: 0/0, joblist: 
1890.audhumbla.fysik.dtu.dk:0,1890.audhumbla.fysik.dtu.dk:1,1890.audhumbla.fysik.dtu.dk:2,1890.audhumbla.fysik.dtu.dk:3,1890.audhumbla.fysik.dtu.dk:4,1890.audhumbla.fysik.dtu.dk:5,1890.audhumbla.fysik.dtu.dk:6,1'

# qstat -f1 1890
qstat: Unknown Job Id 1890.audhumbla.fysik.dtu.dk

# pbsnodes -a g042
g042
      state = job-exclusive
      np = 16
      properties = xeon2670,hp5412g,infiniband,xeon-e5
      ntype = cluster
      jobs = 0/1894.audhumbla.fysik.dtu.dk, 
1/1894.audhumbla.fysik.dtu.dk, 2/1894.audhumbla.fysik.dtu.dk, 
3/1894.audhumbla.fysik.dtu.dk, 4/1894.audhumbla.fysik.dtu.dk, 
5/1894.audhumbla.fysik.dtu.dk, 6/1894.audhumbla.fysik.dtu.dk, 
7/1894.audhumbla.fysik.dtu.dk, 8/1894.audhumbla.fysik.dtu.dk, 
9/1894.audhumbla.fysik.dtu.dk, 10/1894.audhumbla.fysik.dtu.dk, 
11/1894.audhumbla.fysik.dtu.dk, 12/1894.audhumbla.fysik.dtu.dk, 
13/1894.audhumbla.fysik.dtu.dk, 14/1894.audhumbla.fysik.dtu.dk, 
15/1894.audhumbla.fysik.dtu.dk
      status = 
rectime=1353414789,varattr=,jobs=1894.audhumbla.fysik.dtu.dk,state=free,size=32653284kb:32834420kb,netload=667272390,gres=,loadave=16.47,ncpus=16,physmem=65932324kb,availmem=163550852kb,totmem=168332316kb,idletime=1632363,nusers=1,nsessions=1,sessions=36896,uname=Linux 
g042.dcsc.fysik.dtu.dk 2.6.32-279.11.1.el6.x86_64 #1 SMP Tue Oct 16 
15:57:10 UTC 2012 x86_64,opsys=linux,arch=x86_64
      gpus = 0


Apparently, the job/node data structures between Torque and Maui seem to 
be out of sync to the extent that the batch system is almost useless.

-- 
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


More information about the torqueusers mailing list