[torqueusers] Fluent Infiniband jobs fail, only in PBS

James J Coyle jjc at iastate.edu
Fri Feb 29 09:10:01 MST 2008


William,

   We run Fluent her through PBS also.

   I'm not sure I understood your fix.

   The line

  ulimit -l unlimited

appears to be a (Bourne) shell command, but pbs_mom is an executable
built from a C routine.

  Do you mean that you placed this command
in the /var/spool/torque/mom_priv/prologue file?

  Somewhere else?

Thanks,
 - Jim Coyle


> This is a multi-part message in MIME format.
> 
> --===============0426964802==
> Content-class: urn:content-classes:message
> Content-Type: multipart/alternative;
> 	boundary="----_=_NextPart_001_01C87AEA.9835786D"
> 
> This is a multi-part message in MIME format.
> 
> ------_=_NextPart_001_01C87AEA.9835786D
> Content-Type: text/plain;
> 	charset="us-ascii"
> Content-Transfer-Encoding: quoted-printable
> 
> Hello again,
>  I've answered my own question but wanted to follow up with everyone and
> share the result.
> =20
> Basically pbs_mom was being memorylocked to 32k, so this explained the
> strange behavior of jobs run through PBS.
> =20
> The resolution was adding the following line to pbs_mom:
> ulimit -l unlimited
> =20
> And ofcourse, restarting pbs_mom.
> 
> 
> ________________________________
> 
> 	From: torqueusers-bounces at supercluster.org
> [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Edsall,
> William (WJ)
> 	Sent: Thursday, February 28, 2008 3:01 PM
> 	To: torqueusers at supercluster.org
> 	Subject: [torqueusers] Fluent Infiniband jobs fail, only in PBS
> =09
> =09
> 
> 	Hello!=20
> 	  I'm experiencing a strange issue with PBS and the application
> Fluent. We experienced a power outage, and have since had trouble
> running Fluent through PBS over Infiniband.
> 
> 	- Fluent runs fine through PBS on Ethernet=20
> 	- Fluent runs fine outside of PBS on Infiniband=20
> 	- Fluent only fails when run through PBS, over Infiniband=20
> 
> 	Any suggestions? Even if I run PBS with the -I switch, I can't
> run fluent successfully over infiniband. Something environmentally
> changed by PBS is causing MPI to fail. My PBS version is 2.1.7.
> 
> 	Here is the failure result of a job on 2 nodes running through
> PBS:=20
> 	Host spawning Node 0 on machine "node30" (unix).=20
> 	/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35 3ddp
> -node -t16 -pib -mpi=3Dhp -cnf=3D/home/u396929/fluent_test/nodes2 -mport
> 192.168.0.30:192.168.0.30:46683:0
> 
> 	Starting
> /apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun
> -prot -IBV -e MPI_HASIC_IBV=3D1 -f /tmp/fluent-appfile.8049
> 
> 	fluent_mpi.6.3.35: Rank 0:4: MPI_Init: ibv_create_qp() failed=20
> 	fluent_mpi.6.3.35: Rank 0:4: MPI_Init: probably you need to
> increase pinnable memory in /etc/security/limits.conf=20
> 	fluent_mpi.6.3.35: Rank 0:4: MPI_Init: Can't initialize RDMA
> device=20
> 	fluent_mpi.6.3.35: Rank 0:4: MPI_Init: MPI BUG: Cannot
> initialize RDMA protocol=20
> 	MPI Application rank 4 exited before MPI_Init() with status 1=20
> 
> 
> 	Here is the success result from the command line (same nodes):=20
> 	Host spawning Node 0 on machine "node30" (unix).=20
> 	/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35 3ddp
> -node -t16 -pib -mpi=3Dhp -cnf=3D/home/u396929/fluent_test/nodes2 -mport
> 192.168.0.30:192.168.0.30:43334:0
> 
> 	Starting
> /apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun
> -prot -IBV -e MPI_HASIC_IBV=3D1 -f /tmp/fluent-appfile.8871
> 
> 	HP-MPI licensed for Fluent.=20
> 	Host 0 -- ip 192.168.0.30 -- ranks 0 - 7=20
> 	Host 1 -- ip 192.168.0.31 -- ranks 8 - 15=20
> 
> 	Please help! Thanks much!=20
> 
> 	William=20
> 
> 
> ------_=_NextPart_001_01C87AEA.9835786D
> Content-Type: text/html;
> 	charset="us-ascii"
> Content-Transfer-Encoding: quoted-printable
> 
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
> <HTML><HEAD><TITLE>Fluent Infiniband jobs fail, only in PBS</TITLE>
> <META http-equiv=3DContent-Type content=3D"text/html; =
> charset=3Dus-ascii">
> <META content=3D"MSHTML 6.00.2900.3243" name=3DGENERATOR></HEAD>
> <BODY>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2>Hello again,</FONT></SPAN></DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2>&nbsp;I've answered my own question but wanted to follow up =
> with everyone=20
> and share the result.</FONT></SPAN></DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2></FONT></SPAN>&nbsp;</DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2>Basically pbs_mom was being memorylocked to 32k, so this =
> explained the=20
> strange behavior of jobs run through PBS.</FONT></SPAN></DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2></FONT></SPAN>&nbsp;</DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2>The resolution was adding the following line to=20
> pbs_mom:</FONT></SPAN></DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2>ulimit -l unlimited</FONT></SPAN></DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2></FONT></SPAN>&nbsp;</DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2>And ofcourse, restarting pbs_mom.</FONT></SPAN></DIV><BR>
> <BLOCKQUOTE dir=3Dltr style=3D"MARGIN-RIGHT: 0px">
>   <DIV class=3DOutlookMessageHeader lang=3Den-us dir=3Dltr align=3Dleft>
>   <HR tabIndex=3D-1>
>   <FONT face=3DTahoma size=3D2><B>From:</B> =
> torqueusers-bounces at supercluster.org=20
>   [mailto:torqueusers-bounces at supercluster.org] <B>On Behalf Of =
> </B>Edsall,=20
>   William (WJ)<BR><B>Sent:</B> Thursday, February 28, 2008 3:01 =
> PM<BR><B>To:</B>=20
>   torqueusers at supercluster.org<BR><B>Subject:</B> [torqueusers] Fluent=20
>   Infiniband jobs fail, only in PBS<BR></FONT><BR></DIV>
>   <DIV></DIV><!-- Converted from text/rtf format -->
>   <P><FONT face=3DVerdana size=3D2>Hello!</FONT> <BR><FONT =
> face=3DVerdana=20
>   size=3D2>&nbsp; I'm experiencing a strange issue with PBS and the =
> application=20
>   Fluent. We experienced a power outage, and have since had trouble =
> running=20
>   Fluent through PBS over Infiniband.</FONT></P>
>   <P><FONT face=3DVerdana size=3D2>- Fluent runs fine through PBS on =
> Ethernet</FONT>=20
>   <BR><FONT face=3DVerdana size=3D2>- Fluent runs fine outside of PBS on =
> 
>   Infiniband</FONT> <BR><FONT face=3DVerdana size=3D2>- Fluent only =
> fails when run=20
>   through PBS, over Infiniband</FONT> </P>
>   <P><FONT face=3DVerdana size=3D2>Any suggestions? Even if I run PBS =
> with the -I=20
>   switch, I can't run fluent successfully over infiniband. Something=20
>   environmentally changed by PBS is causing MPI to fail. My PBS version =
> is=20
>   2.1.7.</FONT></P>
>   <P><FONT face=3DVerdana size=3D2>Here is the failure result of a job =
> on 2 nodes=20
>   running through PBS:</FONT> <BR><FONT face=3DVerdana size=3D2>Host =
> spawning Node 0=20
>   on machine "node30" (unix).</FONT> <BR><FONT face=3DVerdana=20
>   size=3D2>/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35 3ddp =
> -node=20
>   -t16 -pib -mpi=3Dhp -cnf=3D/home/u396929/fluent_test/nodes2 -mport=20
>   192.168.0.30:192.168.0.30:46683:0</FONT></P>
>   <P><FONT face=3DVerdana size=3D2>Starting=20
>   =
> /apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun =
> -prot=20
>   -IBV -e MPI_HASIC_IBV=3D1 -f /tmp/fluent-appfile.8049</FONT></P>
>   <P><FONT face=3DVerdana size=3D2>fluent_mpi.6.3.35: Rank 0:4: =
> MPI_Init:=20
>   ibv_create_qp() failed</FONT> <BR><FONT face=3DVerdana =
> size=3D2>fluent_mpi.6.3.35:=20
>   Rank 0:4: MPI_Init: probably you need to increase pinnable memory in=20
>   /etc/security/limits.conf</FONT> <BR><FONT face=3DVerdana=20
>   size=3D2>fluent_mpi.6.3.35: Rank 0:4: MPI_Init: Can't initialize RDMA=20
>   device</FONT> <BR><FONT face=3DVerdana size=3D2>fluent_mpi.6.3.35: =
> Rank 0:4:=20
>   MPI_Init: MPI BUG: Cannot initialize RDMA protocol</FONT> <BR><FONT=20
>   face=3DVerdana size=3D2>MPI Application rank 4 exited before =
> MPI_Init() with=20
>   status 1</FONT> </P><BR>
>   <P><FONT face=3DVerdana size=3D2>Here is the success result from the =
> command line=20
>   (same nodes):</FONT> <BR><FONT face=3DVerdana size=3D2>Host spawning =
> Node 0 on=20
>   machine "node30" (unix).</FONT> <BR><FONT face=3DVerdana=20
>   size=3D2>/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35 3ddp =
> -node=20
>   -t16 -pib -mpi=3Dhp -cnf=3D/home/u396929/fluent_test/nodes2 -mport=20
>   192.168.0.30:192.168.0.30:43334:0</FONT></P>
>   <P><FONT face=3DVerdana size=3D2>Starting=20
>   =
> /apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun =
> -prot=20
>   -IBV -e MPI_HASIC_IBV=3D1 -f /tmp/fluent-appfile.8871</FONT></P>
>   <P><FONT face=3DVerdana size=3D2>HP-MPI licensed for Fluent.</FONT> =
> <BR><FONT=20
>   face=3DVerdana size=3D2>Host 0 -- ip 192.168.0.30 -- ranks 0 - =
> 7</FONT> <BR><FONT=20
>   face=3DVerdana size=3D2>Host 1 -- ip 192.168.0.31 -- ranks 8 - =
> 15</FONT> </P>
>   <P><FONT face=3DVerdana size=3D2>Please help! Thanks much!</FONT> </P>
>   <P><FONT face=3DVerdana size=3D2>William</FONT> =
> </P></BLOCKQUOTE></BODY></HTML>
> 
> ------_=_NextPart_001_01C87AEA.9835786D--
> 
> --===============0426964802==
> Content-Type: text/plain; charset="us-ascii"
> MIME-Version: 1.0
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> --===============0426964802==--
> 




More information about the torqueusers mailing list