[torqueusers] Fluent Infiniband jobs fail, only in PBS

Edsall, William (WJ) WJEdsall at dow.com
Fri Feb 29 09:17:18 MST 2008


Jim,
 We placed the ulimit command into our pbs_mom startup script to execute
before mom starts.

To see if you're running into the same issue, try running an interactive
qsub job on one node and typing 'limit' on the node you're sent to, or
adding a 'limit' command to your pbs script.

I was seeing the following:
cputime      unlimited
filesize     unlimited
datasize     unlimited
stacksize    8192 kbytes

coredumpsize 0 kbytes
memoryuse    unlimited
vmemoryuse   19096560 kbytes
descriptors  32768
memorylocked 32 kbytes -  bad!
maxproc      73728

After setting the locked limit to unlimited you should see something
like th is:
cputime      unlimited
filesize     unlimited
datasize     unlimited
stacksize    8192 kbytes
coredumpsize 0 kbytes
memoryuse    unlimited
vmemoryuse   19096560 kbytes
descriptors  1024
memorylocked 7500000 kbytes
maxproc      73728
 

The memorylocked limit will only show its face if you check the limit
command from inside a pbs job.

-----Original Message-----
From: James J Coyle [mailto:jjc at iastate.edu] 
Sent: Friday, February 29, 2008 11:10 AM
To: Edsall, William (WJ)
Cc: torqueusers at supercluster.org
Subject: Re: [torqueusers] Fluent Infiniband jobs fail, only in PBS 

William,

   We run Fluent her through PBS also.

   I'm not sure I understood your fix.

   The line

  ulimit -l unlimited

appears to be a (Bourne) shell command, but pbs_mom is an executable
built from a C routine.

  Do you mean that you placed this command
in the /var/spool/torque/mom_priv/prologue file?

  Somewhere else?

Thanks,
 - Jim Coyle


> This is a multi-part message in MIME format.
> 
> --===============0426964802==
> Content-class: urn:content-classes:message
> Content-Type: multipart/alternative;
> 	boundary="----_=_NextPart_001_01C87AEA.9835786D"
> 
> This is a multi-part message in MIME format.
> 
> ------_=_NextPart_001_01C87AEA.9835786D
> Content-Type: text/plain;
> 	charset="us-ascii"
> Content-Transfer-Encoding: quoted-printable
> 
> Hello again,
>  I've answered my own question but wanted to follow up with everyone
and
> share the result.
> =20
> Basically pbs_mom was being memorylocked to 32k, so this explained the
> strange behavior of jobs run through PBS.
> =20
> The resolution was adding the following line to pbs_mom:
> ulimit -l unlimited
> =20
> And ofcourse, restarting pbs_mom.
> 
> 
> ________________________________
> 
> 	From: torqueusers-bounces at supercluster.org
> [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Edsall,
> William (WJ)
> 	Sent: Thursday, February 28, 2008 3:01 PM
> 	To: torqueusers at supercluster.org
> 	Subject: [torqueusers] Fluent Infiniband jobs fail, only in PBS
> =09
> =09
> 
> 	Hello!=20
> 	  I'm experiencing a strange issue with PBS and the application
> Fluent. We experienced a power outage, and have since had trouble
> running Fluent through PBS over Infiniband.
> 
> 	- Fluent runs fine through PBS on Ethernet=20
> 	- Fluent runs fine outside of PBS on Infiniband=20
> 	- Fluent only fails when run through PBS, over Infiniband=20
> 
> 	Any suggestions? Even if I run PBS with the -I switch, I can't
> run fluent successfully over infiniband. Something environmentally
> changed by PBS is causing MPI to fail. My PBS version is 2.1.7.
> 
> 	Here is the failure result of a job on 2 nodes running through
> PBS:=20
> 	Host spawning Node 0 on machine "node30" (unix).=20
> 	/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35 3ddp
> -node -t16 -pib -mpi=3Dhp -cnf=3D/home/u396929/fluent_test/nodes2
-mport
> 192.168.0.30:192.168.0.30:46683:0
> 
> 	Starting
>
/apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun
> -prot -IBV -e MPI_HASIC_IBV=3D1 -f /tmp/fluent-appfile.8049
> 
> 	fluent_mpi.6.3.35: Rank 0:4: MPI_Init: ibv_create_qp() failed=20
> 	fluent_mpi.6.3.35: Rank 0:4: MPI_Init: probably you need to
> increase pinnable memory in /etc/security/limits.conf=20
> 	fluent_mpi.6.3.35: Rank 0:4: MPI_Init: Can't initialize RDMA
> device=20
> 	fluent_mpi.6.3.35: Rank 0:4: MPI_Init: MPI BUG: Cannot
> initialize RDMA protocol=20
> 	MPI Application rank 4 exited before MPI_Init() with status 1=20
> 
> 
> 	Here is the success result from the command line (same
nodes):=20
> 	Host spawning Node 0 on machine "node30" (unix).=20
> 	/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35 3ddp
> -node -t16 -pib -mpi=3Dhp -cnf=3D/home/u396929/fluent_test/nodes2
-mport
> 192.168.0.30:192.168.0.30:43334:0
> 
> 	Starting
>
/apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun
> -prot -IBV -e MPI_HASIC_IBV=3D1 -f /tmp/fluent-appfile.8871
> 
> 	HP-MPI licensed for Fluent.=20
> 	Host 0 -- ip 192.168.0.30 -- ranks 0 - 7=20
> 	Host 1 -- ip 192.168.0.31 -- ranks 8 - 15=20
> 
> 	Please help! Thanks much!=20
> 
> 	William=20
> 
> 
> ------_=_NextPart_001_01C87AEA.9835786D
> Content-Type: text/html;
> 	charset="us-ascii"
> Content-Transfer-Encoding: quoted-printable
> 
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
> <HTML><HEAD><TITLE>Fluent Infiniband jobs fail, only in PBS</TITLE>
> <META http-equiv=3DContent-Type content=3D"text/html; =
> charset=3Dus-ascii">
> <META content=3D"MSHTML 6.00.2900.3243" name=3DGENERATOR></HEAD>
> <BODY>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2>Hello again,</FONT></SPAN></DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2>&nbsp;I've answered my own question but wanted to follow up =
> with everyone=20
> and share the result.</FONT></SPAN></DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2></FONT></SPAN>&nbsp;</DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2>Basically pbs_mom was being memorylocked to 32k, so this =
> explained the=20
> strange behavior of jobs run through PBS.</FONT></SPAN></DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2></FONT></SPAN>&nbsp;</DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2>The resolution was adding the following line to=20
> pbs_mom:</FONT></SPAN></DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2>ulimit -l unlimited</FONT></SPAN></DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2></FONT></SPAN>&nbsp;</DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2>And ofcourse, restarting pbs_mom.</FONT></SPAN></DIV><BR>
> <BLOCKQUOTE dir=3Dltr style=3D"MARGIN-RIGHT: 0px">
>   <DIV class=3DOutlookMessageHeader lang=3Den-us dir=3Dltr
align=3Dleft>
>   <HR tabIndex=3D-1>
>   <FONT face=3DTahoma size=3D2><B>From:</B> =
> torqueusers-bounces at supercluster.org=20
>   [mailto:torqueusers-bounces at supercluster.org] <B>On Behalf Of =
> </B>Edsall,=20
>   William (WJ)<BR><B>Sent:</B> Thursday, February 28, 2008 3:01 =
> PM<BR><B>To:</B>=20
>   torqueusers at supercluster.org<BR><B>Subject:</B> [torqueusers]
Fluent=20
>   Infiniband jobs fail, only in PBS<BR></FONT><BR></DIV>
>   <DIV></DIV><!-- Converted from text/rtf format -->
>   <P><FONT face=3DVerdana size=3D2>Hello!</FONT> <BR><FONT =
> face=3DVerdana=20
>   size=3D2>&nbsp; I'm experiencing a strange issue with PBS and the =
> application=20
>   Fluent. We experienced a power outage, and have since had trouble =
> running=20
>   Fluent through PBS over Infiniband.</FONT></P>
>   <P><FONT face=3DVerdana size=3D2>- Fluent runs fine through PBS on =
> Ethernet</FONT>=20
>   <BR><FONT face=3DVerdana size=3D2>- Fluent runs fine outside of PBS
on =
> 
>   Infiniband</FONT> <BR><FONT face=3DVerdana size=3D2>- Fluent only =
> fails when run=20
>   through PBS, over Infiniband</FONT> </P>
>   <P><FONT face=3DVerdana size=3D2>Any suggestions? Even if I run PBS
=
> with the -I=20
>   switch, I can't run fluent successfully over infiniband.
Something=20
>   environmentally changed by PBS is causing MPI to fail. My PBS
version =
> is=20
>   2.1.7.</FONT></P>
>   <P><FONT face=3DVerdana size=3D2>Here is the failure result of a job
=
> on 2 nodes=20
>   running through PBS:</FONT> <BR><FONT face=3DVerdana size=3D2>Host =
> spawning Node 0=20
>   on machine "node30" (unix).</FONT> <BR><FONT face=3DVerdana=20
>   size=3D2>/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35
3ddp =
> -node=20
>   -t16 -pib -mpi=3Dhp -cnf=3D/home/u396929/fluent_test/nodes2
-mport=20
>   192.168.0.30:192.168.0.30:46683:0</FONT></P>
>   <P><FONT face=3DVerdana size=3D2>Starting=20
>   =
>
/apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun
=
> -prot=20
>   -IBV -e MPI_HASIC_IBV=3D1 -f /tmp/fluent-appfile.8049</FONT></P>
>   <P><FONT face=3DVerdana size=3D2>fluent_mpi.6.3.35: Rank 0:4: =
> MPI_Init:=20
>   ibv_create_qp() failed</FONT> <BR><FONT face=3DVerdana =
> size=3D2>fluent_mpi.6.3.35:=20
>   Rank 0:4: MPI_Init: probably you need to increase pinnable memory
in=20
>   /etc/security/limits.conf</FONT> <BR><FONT face=3DVerdana=20
>   size=3D2>fluent_mpi.6.3.35: Rank 0:4: MPI_Init: Can't initialize
RDMA=20
>   device</FONT> <BR><FONT face=3DVerdana size=3D2>fluent_mpi.6.3.35: =
> Rank 0:4:=20
>   MPI_Init: MPI BUG: Cannot initialize RDMA protocol</FONT>
<BR><FONT=20
>   face=3DVerdana size=3D2>MPI Application rank 4 exited before =
> MPI_Init() with=20
>   status 1</FONT> </P><BR>
>   <P><FONT face=3DVerdana size=3D2>Here is the success result from the
=
> command line=20
>   (same nodes):</FONT> <BR><FONT face=3DVerdana size=3D2>Host spawning
=
> Node 0 on=20
>   machine "node30" (unix).</FONT> <BR><FONT face=3DVerdana=20
>   size=3D2>/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35
3ddp =
> -node=20
>   -t16 -pib -mpi=3Dhp -cnf=3D/home/u396929/fluent_test/nodes2
-mport=20
>   192.168.0.30:192.168.0.30:43334:0</FONT></P>
>   <P><FONT face=3DVerdana size=3D2>Starting=20
>   =
>
/apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun
=
> -prot=20
>   -IBV -e MPI_HASIC_IBV=3D1 -f /tmp/fluent-appfile.8871</FONT></P>
>   <P><FONT face=3DVerdana size=3D2>HP-MPI licensed for Fluent.</FONT>
=
> <BR><FONT=20
>   face=3DVerdana size=3D2>Host 0 -- ip 192.168.0.30 -- ranks 0 - =
> 7</FONT> <BR><FONT=20
>   face=3DVerdana size=3D2>Host 1 -- ip 192.168.0.31 -- ranks 8 - =
> 15</FONT> </P>
>   <P><FONT face=3DVerdana size=3D2>Please help! Thanks much!</FONT>
</P>
>   <P><FONT face=3DVerdana size=3D2>William</FONT> =
> </P></BLOCKQUOTE></BODY></HTML>
> 
> ------_=_NextPart_001_01C87AEA.9835786D--
> 
> --===============0426964802==
> Content-Type: text/plain; charset="us-ascii"
> MIME-Version: 1.0
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> --===============0426964802==--
> 




More information about the torqueusers mailing list