[torqueusers] torque 4.2.X problems on RHEL 6.4

Ken Nielson knielson at adaptivecomputing.com
Fri Sep 6 13:19:28 MDT 2013


On Fri, Sep 6, 2013 at 12:44 PM, Liam Forbes <lforbes at arsc.edu> wrote:

> On Jul 31, 2013, at 10:29 AM, Rick McKay <rmckay at adaptivecomputing.com>
> wrote:
> > Liam,
> >
> > This is a bug fixed in 4.2.4. You'll find a brief description in the git
> log.
> >
> > Rick
>
> Rick,
>
> I'm not sure how to access the torque git log (we don't download and build
> torque from source).  Can you send instructions, or the specific text you
> believed applies?
>
> We've installed torque 4.2.4 via Penguin Computing, but I'm afraid this
> problem continues although with slightly different error messages.
>
> Sep  6 10:19:17 p7 Sep  6 10:19:17 pbs_mom: LOG_ERROR::pelog_err,
> prolog/epilog failed, file: /var/spool/torque/mom_priv/epilogue.parallel,
> exit: 255, nonzero p/e exit status
> Sep  6 10:19:17 p7 Sep  6 10:19:17 pbs_mom: LOG_ERROR::run_epilogues,
> parallel epilog failed
> Sep  6 10:19:17 p6 Sep  6 10:19:17 pbs_mom: LOG_ERROR::pelog_err,
> prolog/epilog failed, file: /var/spool/torque/mom_priv/epilogue.parallel,
> exit: 255, nonzero p/e exit status
> Sep  6 10:19:17 p6 Sep  6 10:19:17 pbs_mom: LOG_ERROR::run_epilogues,
> parallel epilog failed
> Sep  6 10:19:17 p5 Sep  6 10:19:17 pbs_mom: LOG_ERROR::pelog_err,
> prolog/epilog failed, file: /var/spool/torque/mom_priv/epilogue.parallel,
> exit: 255, nonzero p/e exit status
> Sep  6 10:19:17 p5 Sep  6 10:19:17 pbs_mom: LOG_ERROR::run_epilogues,
> parallel epilog failed
>
> Can you suggest anything else I should look at to track down the cause of
> this problem?
>
> packet:~$ rpm -qi torque
> Name        : torque                       Relocations: (not relocatable)
> Version     : 4.2.4                             Vendor: Penguin Computing,
> Inc.
> Release     : 645g0000                      Build Date: Sun 25 Aug 2013
> 01:20:14 PM AKDT
> Install Date: Fri 30 Aug 2013 04:08:39 PM AKDT      Build Host:
> localhost.localdomain
> Group       : System Environment/Daemons    Source RPM:
> torque-4.2.4-645g0000.src.rpm
> Size        : 7957523                          License: Freely
> redistributable
> Signature   : DSA/SHA1, Sun 25 Aug 2013 01:35:26 PM AKDT, Key ID
> 07224b0a0a1e1108
> Packager    : Penguin Computing, Inc. <http://www.penguincomputing.com>
> URL         : http://www.clusterresources.com/products/torque/
> Summary     : Torque Resource Manager (Tera-scale Open-source Resource and
> QUEue manager)
> Description :
> TORQUE (Tera-scale Open-source Resource and QUEue manager) is a resource
> manager providing control over batch jobs and distributed compute nodes
>
> > On Thu, May 30, 2013 at 11:33 AM, Liam Forbes <lforbes at arsc.edu> wrote:
> > (Originally sent May 30, 2013.)
> >
> >> As part of an upgrade from RHEL 5 to RHEL 6.4, we updated from torque
> 4.2.0 to 4.2.2 on our beowulf cluster (running ClusterWare from Penguin
> Computing).  However, when executing multi-node test jobs, we found the
> epilogue.parallel script is no longer being executed on the sister nodes.
>  Additionally, the 5 minute timeout waiting for the
> epilogue/epilogue.parallel to complete was being hit and the sister nodes
> were marked down by MOAB, but not by torque.  The only way to recover the
> sister nodes, that we know of, is rebooting them.  I'm pretty sure this
> wasn't a problem for the three months that we were running torque 4.2.0 on
> RHEL 5.
> >>
> >> Looking at an strace of a MOM process, and the spawned child processes,
> on a sister node, I can not find any exec*() of the epilogue.parallel
> script, but I can find the execve() for the prologue.parallel.  Both
> scripts have the same contents, are located in the same directory, and have
> the same file permissions.
> >>
> >> $ sudo ls -al /var/spool/torque/mom_priv/
> >> total 32
> >> drwxr-x--x.  3 root root     4096 May 28 11:18 .
> >> drwxr-xr-x. 15 root root     4096 May 27 17:34 ..
> >> -rw-r--r--   1 root root      336 Aug  9  2012 config
> >> -rwxr-xr-x   1 root linuxman 2836 May 28 09:52 epilogue
> >> -rwxr-xr-x   1 root linuxman 2836 May 28 09:52 epilogue.parallel
> >> drwxr-x--x   2 root root     4096 Nov 21  2012 jobs
> >> -rwxr-xr-x   1 root linuxman 2836 May 28 09:52 prologue
> >> -rwxr-xr-x   1 root linuxman 2836 May 28 09:52 prologue.parallel
> >>
> >> I tried replacing the epilogue.parallel script with one that logs to
> syslog and then exits, but even that wasn't executed.  I'm pretty sure it's
> not the contents of the script.
> >>
> >> Eventually, we found a workaround.  We downgraded torque to 4.1.3 on
> our production cluster.  Actually, we downgraded to 4.2.1, then 4.2.0, and
> then 4.1.3.  None of the 4.2.X versions executed the epilogue.parallel
> script.  4.1.3 does, although it still seems to wait the full 5 minutes for
> the job to clear.  Fortunately, the nodes aren't marked down in MOAB when
> the timeout finally occurs.  Our test cluster still has 4.2.2 installed for
> further testing and diagnostics.
> >>
> >> Attached are the log entries, MOM and syslog, from one occurrence of
> the problem.  Any assistance would be appreciated.  I'm at a bit of a loss
> as to how to proceed tracking down this problem.
>
> Regards,
> -liam
>
> -There are uncountably more irrational fears than rational ones. -P. Dolan
> Liam Forbes             Senior HPC Systems Analyst,           LPIC1, CISSP
> ARSC, U of AK, Fairbanks   lforbes at arsc.edu 907-450-8618 fax: 907-450-8605
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>

I am adding Liz Chan to this thread. She is the engineer that made the fix
for this problem.

Her fix is in 4.2.5 which will be coming out shortly.

Ken

-- 
Ken Nielson
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
www.adaptivecomputing.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130906/eaf313aa/attachment-0001.html 


More information about the torqueusers mailing list