[torqueusers] Getting Torque 2.x/Maui 3.x to work on CentOS 6?
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Tue Nov 20 05:45:41 MST 2012
We're upgrading our cluster to CentOS 6.3 and would like to run recent
versions of Torque 2.x and Maui 3.x. Unfortunately, it seems that with
Torque version 2.5.12 and Maui (version 3.3.1 as well as 3.2.6p21
tested), Maui often wants to run jobs on nodes that are already full
with running jobs. It seems that we have the same problem as described
in this July 2012 posting (with no solution posted):
http://www.supercluster.org/pipermail/torqueusers/2012-July/014848.html
QUESTION:
---------
Can anyone recommend versions of Torque 2.x and Maui 3.x which have been
demonstrated to work correctly on CentOS 6.x?
Our older cluster installation runs Torque 2.3.7 and Maui 3.2.6p21 on
CentOS 5.3 and works like a charm. Perhaps we should go back to Torque
2.3 on CentOS 6 as well?
Further details:
----------------
For example, we're seeing this strange behavior where Maui wants to
schedule job 1891 to run on node g042 (which is already full running job
1984). The checkjob command claims that node g042 has a joblist
containing job 1890, which was already completed. There are many other
free nodes available for scheduling jobs.
# checkjob 1891
...
Holds: Defer
Messages: cannot start job - RM failure, rc: 15046, msg: 'Resource
temporarily unavailable REJHOST=g042 MSG=cannot allocate node 'g042' to
job - node not currently available (nps needed/free: 16/0, gpus
needed/free: 0/0, joblist:
1890.audhumbla.fysik.dtu.dk:0,1890.audhumbla.fysik.dtu.dk:1,1890.audhumbla.fysik.dtu.dk:2,1890.audhumbla.fysik.dtu.dk:3,1890.audhumbla.fysik.dtu.dk:4,1890.audhumbla.fysik.dtu.dk:5,1890.audhumbla.fysik.dtu.dk:6,1'
# qstat -f1 1890
qstat: Unknown Job Id 1890.audhumbla.fysik.dtu.dk
# pbsnodes -a g042
g042
state = job-exclusive
np = 16
properties = xeon2670,hp5412g,infiniband,xeon-e5
ntype = cluster
jobs = 0/1894.audhumbla.fysik.dtu.dk,
1/1894.audhumbla.fysik.dtu.dk, 2/1894.audhumbla.fysik.dtu.dk,
3/1894.audhumbla.fysik.dtu.dk, 4/1894.audhumbla.fysik.dtu.dk,
5/1894.audhumbla.fysik.dtu.dk, 6/1894.audhumbla.fysik.dtu.dk,
7/1894.audhumbla.fysik.dtu.dk, 8/1894.audhumbla.fysik.dtu.dk,
9/1894.audhumbla.fysik.dtu.dk, 10/1894.audhumbla.fysik.dtu.dk,
11/1894.audhumbla.fysik.dtu.dk, 12/1894.audhumbla.fysik.dtu.dk,
13/1894.audhumbla.fysik.dtu.dk, 14/1894.audhumbla.fysik.dtu.dk,
15/1894.audhumbla.fysik.dtu.dk
status =
rectime=1353414789,varattr=,jobs=1894.audhumbla.fysik.dtu.dk,state=free,size=32653284kb:32834420kb,netload=667272390,gres=,loadave=16.47,ncpus=16,physmem=65932324kb,availmem=163550852kb,totmem=168332316kb,idletime=1632363,nusers=1,nsessions=1,sessions=36896,uname=Linux
g042.dcsc.fysik.dtu.dk 2.6.32-279.11.1.el6.x86_64 #1 SMP Tue Oct 16
15:57:10 UTC 2012 x86_64,opsys=linux,arch=x86_64
gpus = 0
Apparently, the job/node data structures between Torque and Maui seem to
be out of sync to the extent that the batch system is almost useless.
--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark
More information about the torqueusers
mailing list