[torqueusers] Fluent Infiniband jobs fail, only in PBS
James J Coyle
jjc at iastate.edu
Fri Feb 29 09:10:01 MST 2008
William,
We run Fluent her through PBS also.
I'm not sure I understood your fix.
The line
ulimit -l unlimited
appears to be a (Bourne) shell command, but pbs_mom is an executable
built from a C routine.
Do you mean that you placed this command
in the /var/spool/torque/mom_priv/prologue file?
Somewhere else?
Thanks,
- Jim Coyle
> This is a multi-part message in MIME format.
>
> --===============0426964802==
> Content-class: urn:content-classes:message
> Content-Type: multipart/alternative;
> boundary="----_=_NextPart_001_01C87AEA.9835786D"
>
> This is a multi-part message in MIME format.
>
> ------_=_NextPart_001_01C87AEA.9835786D
> Content-Type: text/plain;
> charset="us-ascii"
> Content-Transfer-Encoding: quoted-printable
>
> Hello again,
> I've answered my own question but wanted to follow up with everyone and
> share the result.
> =20
> Basically pbs_mom was being memorylocked to 32k, so this explained the
> strange behavior of jobs run through PBS.
> =20
> The resolution was adding the following line to pbs_mom:
> ulimit -l unlimited
> =20
> And ofcourse, restarting pbs_mom.
>
>
> ________________________________
>
> From: torqueusers-bounces at supercluster.org
> [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Edsall,
> William (WJ)
> Sent: Thursday, February 28, 2008 3:01 PM
> To: torqueusers at supercluster.org
> Subject: [torqueusers] Fluent Infiniband jobs fail, only in PBS
> =09
> =09
>
> Hello!=20
> I'm experiencing a strange issue with PBS and the application
> Fluent. We experienced a power outage, and have since had trouble
> running Fluent through PBS over Infiniband.
>
> - Fluent runs fine through PBS on Ethernet=20
> - Fluent runs fine outside of PBS on Infiniband=20
> - Fluent only fails when run through PBS, over Infiniband=20
>
> Any suggestions? Even if I run PBS with the -I switch, I can't
> run fluent successfully over infiniband. Something environmentally
> changed by PBS is causing MPI to fail. My PBS version is 2.1.7.
>
> Here is the failure result of a job on 2 nodes running through
> PBS:=20
> Host spawning Node 0 on machine "node30" (unix).=20
> /apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35 3ddp
> -node -t16 -pib -mpi=3Dhp -cnf=3D/home/u396929/fluent_test/nodes2 -mport
> 192.168.0.30:192.168.0.30:46683:0
>
> Starting
> /apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun
> -prot -IBV -e MPI_HASIC_IBV=3D1 -f /tmp/fluent-appfile.8049
>
> fluent_mpi.6.3.35: Rank 0:4: MPI_Init: ibv_create_qp() failed=20
> fluent_mpi.6.3.35: Rank 0:4: MPI_Init: probably you need to
> increase pinnable memory in /etc/security/limits.conf=20
> fluent_mpi.6.3.35: Rank 0:4: MPI_Init: Can't initialize RDMA
> device=20
> fluent_mpi.6.3.35: Rank 0:4: MPI_Init: MPI BUG: Cannot
> initialize RDMA protocol=20
> MPI Application rank 4 exited before MPI_Init() with status 1=20
>
>
> Here is the success result from the command line (same nodes):=20
> Host spawning Node 0 on machine "node30" (unix).=20
> /apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35 3ddp
> -node -t16 -pib -mpi=3Dhp -cnf=3D/home/u396929/fluent_test/nodes2 -mport
> 192.168.0.30:192.168.0.30:43334:0
>
> Starting
> /apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun
> -prot -IBV -e MPI_HASIC_IBV=3D1 -f /tmp/fluent-appfile.8871
>
> HP-MPI licensed for Fluent.=20
> Host 0 -- ip 192.168.0.30 -- ranks 0 - 7=20
> Host 1 -- ip 192.168.0.31 -- ranks 8 - 15=20
>
> Please help! Thanks much!=20
>
> William=20
>
>
> ------_=_NextPart_001_01C87AEA.9835786D
> Content-Type: text/html;
> charset="us-ascii"
> Content-Transfer-Encoding: quoted-printable
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
> <HTML><HEAD><TITLE>Fluent Infiniband jobs fail, only in PBS</TITLE>
> <META http-equiv=3DContent-Type content=3D"text/html; =
> charset=3Dus-ascii">
> <META content=3D"MSHTML 6.00.2900.3243" name=3DGENERATOR></HEAD>
> <BODY>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2>Hello again,</FONT></SPAN></DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2> I've answered my own question but wanted to follow up =
> with everyone=20
> and share the result.</FONT></SPAN></DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2></FONT></SPAN> </DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2>Basically pbs_mom was being memorylocked to 32k, so this =
> explained the=20
> strange behavior of jobs run through PBS.</FONT></SPAN></DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2></FONT></SPAN> </DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2>The resolution was adding the following line to=20
> pbs_mom:</FONT></SPAN></DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2>ulimit -l unlimited</FONT></SPAN></DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2></FONT></SPAN> </DIV>
> <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> face=3DVerdana=20
> size=3D2>And ofcourse, restarting pbs_mom.</FONT></SPAN></DIV><BR>
> <BLOCKQUOTE dir=3Dltr style=3D"MARGIN-RIGHT: 0px">
> <DIV class=3DOutlookMessageHeader lang=3Den-us dir=3Dltr align=3Dleft>
> <HR tabIndex=3D-1>
> <FONT face=3DTahoma size=3D2><B>From:</B> =
> torqueusers-bounces at supercluster.org=20
> [mailto:torqueusers-bounces at supercluster.org] <B>On Behalf Of =
> </B>Edsall,=20
> William (WJ)<BR><B>Sent:</B> Thursday, February 28, 2008 3:01 =
> PM<BR><B>To:</B>=20
> torqueusers at supercluster.org<BR><B>Subject:</B> [torqueusers] Fluent=20
> Infiniband jobs fail, only in PBS<BR></FONT><BR></DIV>
> <DIV></DIV><!-- Converted from text/rtf format -->
> <P><FONT face=3DVerdana size=3D2>Hello!</FONT> <BR><FONT =
> face=3DVerdana=20
> size=3D2> I'm experiencing a strange issue with PBS and the =
> application=20
> Fluent. We experienced a power outage, and have since had trouble =
> running=20
> Fluent through PBS over Infiniband.</FONT></P>
> <P><FONT face=3DVerdana size=3D2>- Fluent runs fine through PBS on =
> Ethernet</FONT>=20
> <BR><FONT face=3DVerdana size=3D2>- Fluent runs fine outside of PBS on =
>
> Infiniband</FONT> <BR><FONT face=3DVerdana size=3D2>- Fluent only =
> fails when run=20
> through PBS, over Infiniband</FONT> </P>
> <P><FONT face=3DVerdana size=3D2>Any suggestions? Even if I run PBS =
> with the -I=20
> switch, I can't run fluent successfully over infiniband. Something=20
> environmentally changed by PBS is causing MPI to fail. My PBS version =
> is=20
> 2.1.7.</FONT></P>
> <P><FONT face=3DVerdana size=3D2>Here is the failure result of a job =
> on 2 nodes=20
> running through PBS:</FONT> <BR><FONT face=3DVerdana size=3D2>Host =
> spawning Node 0=20
> on machine "node30" (unix).</FONT> <BR><FONT face=3DVerdana=20
> size=3D2>/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35 3ddp =
> -node=20
> -t16 -pib -mpi=3Dhp -cnf=3D/home/u396929/fluent_test/nodes2 -mport=20
> 192.168.0.30:192.168.0.30:46683:0</FONT></P>
> <P><FONT face=3DVerdana size=3D2>Starting=20
> =
> /apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun =
> -prot=20
> -IBV -e MPI_HASIC_IBV=3D1 -f /tmp/fluent-appfile.8049</FONT></P>
> <P><FONT face=3DVerdana size=3D2>fluent_mpi.6.3.35: Rank 0:4: =
> MPI_Init:=20
> ibv_create_qp() failed</FONT> <BR><FONT face=3DVerdana =
> size=3D2>fluent_mpi.6.3.35:=20
> Rank 0:4: MPI_Init: probably you need to increase pinnable memory in=20
> /etc/security/limits.conf</FONT> <BR><FONT face=3DVerdana=20
> size=3D2>fluent_mpi.6.3.35: Rank 0:4: MPI_Init: Can't initialize RDMA=20
> device</FONT> <BR><FONT face=3DVerdana size=3D2>fluent_mpi.6.3.35: =
> Rank 0:4:=20
> MPI_Init: MPI BUG: Cannot initialize RDMA protocol</FONT> <BR><FONT=20
> face=3DVerdana size=3D2>MPI Application rank 4 exited before =
> MPI_Init() with=20
> status 1</FONT> </P><BR>
> <P><FONT face=3DVerdana size=3D2>Here is the success result from the =
> command line=20
> (same nodes):</FONT> <BR><FONT face=3DVerdana size=3D2>Host spawning =
> Node 0 on=20
> machine "node30" (unix).</FONT> <BR><FONT face=3DVerdana=20
> size=3D2>/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35 3ddp =
> -node=20
> -t16 -pib -mpi=3Dhp -cnf=3D/home/u396929/fluent_test/nodes2 -mport=20
> 192.168.0.30:192.168.0.30:43334:0</FONT></P>
> <P><FONT face=3DVerdana size=3D2>Starting=20
> =
> /apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun =
> -prot=20
> -IBV -e MPI_HASIC_IBV=3D1 -f /tmp/fluent-appfile.8871</FONT></P>
> <P><FONT face=3DVerdana size=3D2>HP-MPI licensed for Fluent.</FONT> =
> <BR><FONT=20
> face=3DVerdana size=3D2>Host 0 -- ip 192.168.0.30 -- ranks 0 - =
> 7</FONT> <BR><FONT=20
> face=3DVerdana size=3D2>Host 1 -- ip 192.168.0.31 -- ranks 8 - =
> 15</FONT> </P>
> <P><FONT face=3DVerdana size=3D2>Please help! Thanks much!</FONT> </P>
> <P><FONT face=3DVerdana size=3D2>William</FONT> =
> </P></BLOCKQUOTE></BODY></HTML>
>
> ------_=_NextPart_001_01C87AEA.9835786D--
>
> --===============0426964802==
> Content-Type: text/plain; charset="us-ascii"
> MIME-Version: 1.0
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
> --===============0426964802==--
>
More information about the torqueusers
mailing list