[torqueusers] Getting Torque 2.x/Maui 3.x to work on CentOS 6?
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Tue Nov 20 09:19:30 MST 2012
Thanks very much for your detailed observations! I've also heard from a
colleague who uses Torque 2.3.6 and Maui 3.2.6p21 successfully. Some of
his compute nodes run CentOS 6.3. So I think we may need to downgrade
to the older versions of Torque and Maui which seem to work reliably. Or
perhaps start using SLURM in stead...
On 11/20/2012 04:32 PM, Paul Raines wrote:
> I never figured out that problem I had in the post you linked, but it
> disappeared after I shut down all servers, cleared all queues of all
> jobs, and restarted everything.
> My best guess is there is a bug that causes it when you add new nodes while
> there are jobs running and queued. SO only add new nodes when there are
> no jobs.
> What I still have are two big problems that I am just living with for now.
> The worst is when a node "goes bad" such as its root disk fails. When that
> happens, job still get scheduled on the node but fail in a manner where
> they just get stuck with checkjob giving a reason of "Execution server
> rejected request MSG=cannot send job to mom, state=PRERUN". THe bad thing
> is it keeps queuing all new jobs on the node which get stuck the same way.
> Yesterday when this happened, I had over 180 jobs trying to run on the
> bad node and getting stuck. THe only solution is to take the node offline
> and then qdel all the jobs and email all my users apologizing and telling
> them to resubmit all their jobs.
> The second is simply that maui crashes all the time due to what appears
> to be memory segfaults. I run it in a screen session inside valgrind
> in a infinite loop so it restarts as soon as it crashes. Sometimes it
> crashes in a way that it hangs. So I have to have a cron job constantly
> doing 'showq' and emailing me when it fails so I know I have to go 'kill
> the valgrind to get maui to restart.
> I am using maui-3.3.1 and torque-2.5.11
> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
> On Tue, 20 Nov 2012 7:45am, Ole Holm Nielsen wrote:
>> We're upgrading our cluster to CentOS 6.3 and would like to run recent
>> versions of Torque 2.x and Maui 3.x. Unfortunately, it seems that
>> with Torque version 2.5.12 and Maui (version 3.3.1 as well as 3.2.6p21
>> tested), Maui often wants to run jobs on nodes that are already full
>> with running jobs. It seems that we have the same problem as
>> described in this July 2012 posting (with no solution posted):
>> Can anyone recommend versions of Torque 2.x and Maui 3.x which have
>> been demonstrated to work correctly on CentOS 6.x?
>> Our older cluster installation runs Torque 2.3.7 and Maui 3.2.6p21 on
>> CentOS 5.3 and works like a charm. Perhaps we should go back to
>> Torque 2.3 on CentOS 6 as well?
>> Further details:
>> For example, we're seeing this strange behavior where Maui wants to
>> schedule job 1891 to run on node g042 (which is already full running
>> job 1984). The checkjob command claims that node g042 has a joblist
>> containing job 1890, which was already completed. There are many
>> other free nodes available for scheduling jobs.
>> # checkjob 1891
>> Holds: Defer
>> Messages: cannot start job - RM failure, rc: 15046, msg: 'Resource
>> temporarily unavailable REJHOST=g042 MSG=cannot allocate node 'g042'
>> to job - node not currently available (nps needed/free: 16/0, gpus
>> needed/free: 0/0, joblist:
>> # qstat -f1 1890
>> qstat: Unknown Job Id 1890.audhumbla.fysik.dtu.dk
>> # pbsnodes -a g042
>> state = job-exclusive
>> np = 16
>> properties = xeon2670,hp5412g,infiniband,xeon-e5
>> ntype = cluster
>> jobs = 0/1894.audhumbla.fysik.dtu.dk,
>> 1/1894.audhumbla.fysik.dtu.dk, 2/1894.audhumbla.fysik.dtu.dk,
>> 3/1894.audhumbla.fysik.dtu.dk, 4/1894.audhumbla.fysik.dtu.dk,
>> 5/1894.audhumbla.fysik.dtu.dk, 6/1894.audhumbla.fysik.dtu.dk,
>> 7/1894.audhumbla.fysik.dtu.dk, 8/1894.audhumbla.fysik.dtu.dk,
>> 9/1894.audhumbla.fysik.dtu.dk, 10/1894.audhumbla.fysik.dtu.dk,
>> 11/1894.audhumbla.fysik.dtu.dk, 12/1894.audhumbla.fysik.dtu.dk,
>> 13/1894.audhumbla.fysik.dtu.dk, 14/1894.audhumbla.fysik.dtu.dk,
>> status =
>> g042.dcsc.fysik.dtu.dk 2.6.32-279.11.1.el6.x86_64 #1 SMP Tue Oct 16
>> 15:57:10 UTC 2012 x86_64,opsys=linux,arch=x86_64
>> gpus = 0
>> Apparently, the job/node data structures between Torque and Maui seem
>> to be out of sync to the extent that the batch system is almost useless.
>> Ole Holm Nielsen
>> Department of Physics, Technical University of Denmark
> The information in this e-mail is intended only for the person to whom
> it is
> addressed. If you believe this e-mail was sent to you in error and the
> contains patient information, please contact the Partners Compliance
> HelpLine at
> http://www.partners.org/complianceline . If the e-mail was sent to you
> in error
> but does not contain patient information, please contact the sender and
> dispose of the e-mail.
Ole Holm Nielsen
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
E-mail: Ole.H.Nielsen at fysik.dtu.dk
Tel: (+45) 4525 3187 / Mobile (+45) 5180 1620 / Fax: (+45) 4593 2399
More information about the torqueusers