[torqueusers] Getting Torque 2.x/Maui 3.x to work on CentOS 6?

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Tue Nov 20 09:19:30 MST 2012


Hi Paul,

Thanks very much for your detailed observations!  I've also heard from a 
colleague who uses Torque 2.3.6 and Maui 3.2.6p21 successfully. Some of 
his compute nodes run CentOS 6.3.  So I think we may need to downgrade 
to the older versions of Torque and Maui which seem to work reliably. Or 
perhaps start using SLURM in stead...

/Ole

On 11/20/2012 04:32 PM, Paul Raines wrote:
>
> I never figured out that problem I had in the post you linked, but it
> disappeared after I shut down all servers, cleared all queues of all
> jobs, and restarted everything.
>
> My best guess is there is a bug that causes it when you add new nodes while
> there are jobs running and queued.  SO only add new nodes when there are
> no jobs.
>
> What I still have are two big problems that I am just living with for now.
> The worst is when a node "goes bad" such as its root disk fails.  When that
> happens, job still get scheduled on the node but fail in a manner where
> they just get stuck with checkjob giving a reason of "Execution server
> rejected request MSG=cannot send job to mom, state=PRERUN". THe bad thing
> is it keeps queuing all new jobs on the node which get stuck the same way.
> Yesterday when this happened, I had over 180 jobs trying to run on the
> bad node and getting stuck.  THe only solution is to take the node offline
> and then qdel all the jobs and email all my users apologizing and telling
> them to resubmit all their jobs.
>
> The second is simply that maui crashes all the time due to what appears
> to be memory segfaults.  I run it in a screen session inside valgrind
> in a infinite loop so it restarts as soon as it crashes.  Sometimes it
> crashes in a way that it hangs.  So I have to have a cron job constantly
> doing 'showq' and emailing me when it fails so I know I have to go 'kill
> -9'
> the valgrind to get maui to restart.
>
> I am using maui-3.3.1 and torque-2.5.11
>
> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>
>
>
> On Tue, 20 Nov 2012 7:45am, Ole Holm Nielsen wrote:
>
>> We're upgrading our cluster to CentOS 6.3 and would like to run recent
>> versions of Torque 2.x and Maui 3.x.  Unfortunately, it seems that
>> with Torque version 2.5.12 and Maui (version 3.3.1 as well as 3.2.6p21
>> tested), Maui often wants to run jobs on nodes that are already full
>> with running jobs.  It seems that we have the same problem as
>> described in this July 2012 posting (with no solution posted):
>> http://www.supercluster.org/pipermail/torqueusers/2012-July/014848.html
>>
>> QUESTION:
>> ---------
>> Can anyone recommend versions of Torque 2.x and Maui 3.x which have
>> been demonstrated to work correctly on CentOS 6.x?
>>
>> Our older cluster installation runs Torque 2.3.7 and Maui 3.2.6p21 on
>> CentOS 5.3 and works like a charm.  Perhaps we should go back to
>> Torque 2.3 on CentOS 6 as well?
>>
>> Further details:
>> ----------------
>>
>> For example, we're seeing this strange behavior where Maui wants to
>> schedule job 1891 to run on node g042 (which is already full running
>> job 1984). The checkjob command claims that node g042 has a joblist
>> containing job 1890, which was already completed.  There are many
>> other free nodes available for scheduling jobs.
>>
>> # checkjob 1891
>> ...
>> Holds:    Defer
>> Messages:  cannot start job - RM failure, rc: 15046, msg: 'Resource
>> temporarily unavailable REJHOST=g042 MSG=cannot allocate node 'g042'
>> to job - node not currently available (nps needed/free: 16/0, gpus
>> needed/free: 0/0, joblist:
>> 1890.audhumbla.fysik.dtu.dk:0,1890.audhumbla.fysik.dtu.dk:1,1890.audhumbla.fysik.dtu.dk:2,1890.audhumbla.fysik.dtu.dk:3,1890.audhumbla.fysik.dtu.dk:4,1890.audhumbla.fysik.dtu.dk:5,1890.audhumbla.fysik.dtu.dk:6,1'
>>
>>
>> # qstat -f1 1890
>> qstat: Unknown Job Id 1890.audhumbla.fysik.dtu.dk
>>
>> # pbsnodes -a g042
>> g042
>>     state = job-exclusive
>>     np = 16
>>     properties = xeon2670,hp5412g,infiniband,xeon-e5
>>     ntype = cluster
>>     jobs = 0/1894.audhumbla.fysik.dtu.dk,
>> 1/1894.audhumbla.fysik.dtu.dk, 2/1894.audhumbla.fysik.dtu.dk,
>> 3/1894.audhumbla.fysik.dtu.dk, 4/1894.audhumbla.fysik.dtu.dk,
>> 5/1894.audhumbla.fysik.dtu.dk, 6/1894.audhumbla.fysik.dtu.dk,
>> 7/1894.audhumbla.fysik.dtu.dk, 8/1894.audhumbla.fysik.dtu.dk,
>> 9/1894.audhumbla.fysik.dtu.dk, 10/1894.audhumbla.fysik.dtu.dk,
>> 11/1894.audhumbla.fysik.dtu.dk, 12/1894.audhumbla.fysik.dtu.dk,
>> 13/1894.audhumbla.fysik.dtu.dk, 14/1894.audhumbla.fysik.dtu.dk,
>> 15/1894.audhumbla.fysik.dtu.dk
>>     status =
>> rectime=1353414789,varattr=,jobs=1894.audhumbla.fysik.dtu.dk,state=free,size=32653284kb:32834420kb,netload=667272390,gres=,loadave=16.47,ncpus=16,physmem=65932324kb,availmem=163550852kb,totmem=168332316kb,idletime=1632363,nusers=1,nsessions=1,sessions=36896,uname=Linux
>> g042.dcsc.fysik.dtu.dk 2.6.32-279.11.1.el6.x86_64 #1 SMP Tue Oct 16
>> 15:57:10 UTC 2012 x86_64,opsys=linux,arch=x86_64
>>     gpus = 0
>>
>>
>> Apparently, the job/node data structures between Torque and Maui seem
>> to be out of sync to the extent that the batch system is almost useless.
>>
>> --
>> Ole Holm Nielsen
>> Department of Physics, Technical University of Denmark
>>
>>
>>
>
>
> The information in this e-mail is intended only for the person to whom
> it is
> addressed. If you believe this e-mail was sent to you in error and the
> e-mail
> contains patient information, please contact the Partners Compliance
> HelpLine at
> http://www.partners.org/complianceline . If the e-mail was sent to you
> in error
> but does not contain patient information, please contact the sender and
> properly
> dispose of the e-mail.
>

-- 
Ole Holm Nielsen
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
E-mail: Ole.H.Nielsen at fysik.dtu.dk
Homepage: http://www.fysik.dtu.dk/~ohnielse/
Tel: (+45) 4525 3187 / Mobile (+45) 5180 1620 / Fax: (+45) 4593 2399


More information about the torqueusers mailing list