[torqueusers] Getting Torque 2.x/Maui 3.x to work on CentOS 6?

Paul Raines raines at nmr.mgh.harvard.edu
Tue Nov 20 08:32:12 MST 2012


I never figured out that problem I had in the post you linked, but it 
disappeared after I shut down all servers, cleared all queues of all jobs, and 
restarted everything.

My best guess is there is a bug that causes it when you add new nodes while
there are jobs running and queued.  SO only add new nodes when there are
no jobs.

What I still have are two big problems that I am just living with for now.
The worst is when a node "goes bad" such as its root disk fails.  When that
happens, job still get scheduled on the node but fail in a manner where
they just get stuck with checkjob giving a reason of "Execution server 
rejected request MSG=cannot send job to mom, state=PRERUN". THe bad thing
is it keeps queuing all new jobs on the node which get stuck the same way.
Yesterday when this happened, I had over 180 jobs trying to run on the
bad node and getting stuck.  THe only solution is to take the node offline
and then qdel all the jobs and email all my users apologizing and telling
them to resubmit all their jobs.

The second is simply that maui crashes all the time due to what appears
to be memory segfaults.  I run it in a screen session inside valgrind
in a infinite loop so it restarts as soon as it crashes.  Sometimes it
crashes in a way that it hangs.  So I have to have a cron job constantly
doing 'showq' and emailing me when it fails so I know I have to go 'kill -9'
the valgrind to get maui to restart.

I am using maui-3.3.1 and torque-2.5.11

-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Tue, 20 Nov 2012 7:45am, Ole Holm Nielsen wrote:

> We're upgrading our cluster to CentOS 6.3 and would like to run recent 
> versions of Torque 2.x and Maui 3.x.  Unfortunately, it seems that with 
> Torque version 2.5.12 and Maui (version 3.3.1 as well as 3.2.6p21 tested), 
> Maui often wants to run jobs on nodes that are already full with running 
> jobs.  It seems that we have the same problem as described in this July 2012 
> posting (with no solution posted):
> http://www.supercluster.org/pipermail/torqueusers/2012-July/014848.html
>
> QUESTION:
> ---------
> Can anyone recommend versions of Torque 2.x and Maui 3.x which have been 
> demonstrated to work correctly on CentOS 6.x?
>
> Our older cluster installation runs Torque 2.3.7 and Maui 3.2.6p21 on CentOS 
> 5.3 and works like a charm.  Perhaps we should go back to Torque 2.3 on 
> CentOS 6 as well?
>
> Further details:
> ----------------
>
> For example, we're seeing this strange behavior where Maui wants to schedule 
> job 1891 to run on node g042 (which is already full running job 1984). The 
> checkjob command claims that node g042 has a joblist containing job 1890, 
> which was already completed.  There are many other free nodes available for 
> scheduling jobs.
>
> # checkjob 1891
> ...
> Holds:    Defer
> Messages:  cannot start job - RM failure, rc: 15046, msg: 'Resource 
> temporarily unavailable REJHOST=g042 MSG=cannot allocate node 'g042' to job - 
> node not currently available (nps needed/free: 16/0, gpus needed/free: 0/0, 
> joblist: 
> 1890.audhumbla.fysik.dtu.dk:0,1890.audhumbla.fysik.dtu.dk:1,1890.audhumbla.fysik.dtu.dk:2,1890.audhumbla.fysik.dtu.dk:3,1890.audhumbla.fysik.dtu.dk:4,1890.audhumbla.fysik.dtu.dk:5,1890.audhumbla.fysik.dtu.dk:6,1'
>
> # qstat -f1 1890
> qstat: Unknown Job Id 1890.audhumbla.fysik.dtu.dk
>
> # pbsnodes -a g042
> g042
>     state = job-exclusive
>     np = 16
>     properties = xeon2670,hp5412g,infiniband,xeon-e5
>     ntype = cluster
>     jobs = 0/1894.audhumbla.fysik.dtu.dk, 1/1894.audhumbla.fysik.dtu.dk, 
> 2/1894.audhumbla.fysik.dtu.dk, 3/1894.audhumbla.fysik.dtu.dk, 
> 4/1894.audhumbla.fysik.dtu.dk, 5/1894.audhumbla.fysik.dtu.dk, 
> 6/1894.audhumbla.fysik.dtu.dk, 7/1894.audhumbla.fysik.dtu.dk, 
> 8/1894.audhumbla.fysik.dtu.dk, 9/1894.audhumbla.fysik.dtu.dk, 
> 10/1894.audhumbla.fysik.dtu.dk, 11/1894.audhumbla.fysik.dtu.dk, 
> 12/1894.audhumbla.fysik.dtu.dk, 13/1894.audhumbla.fysik.dtu.dk, 
> 14/1894.audhumbla.fysik.dtu.dk, 15/1894.audhumbla.fysik.dtu.dk
>     status = 
> rectime=1353414789,varattr=,jobs=1894.audhumbla.fysik.dtu.dk,state=free,size=32653284kb:32834420kb,netload=667272390,gres=,loadave=16.47,ncpus=16,physmem=65932324kb,availmem=163550852kb,totmem=168332316kb,idletime=1632363,nusers=1,nsessions=1,sessions=36896,uname=Linux 
> g042.dcsc.fysik.dtu.dk 2.6.32-279.11.1.el6.x86_64 #1 SMP Tue Oct 16 15:57:10 
> UTC 2012 x86_64,opsys=linux,arch=x86_64
>     gpus = 0
>
>
> Apparently, the job/node data structures between Torque and Maui seem to be 
> out of sync to the extent that the batch system is almost useless.
>
> -- 
> Ole Holm Nielsen
> Department of Physics, Technical University of Denmark
>
>
>


The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.



More information about the torqueusers mailing list