[torqueusers] Fluent Infiniband jobs fail, only in PBS
James J Coyle
jjc at iastate.edu
Fri Feb 29 09:54:14 MST 2008
William,
Thanks for the explanation.
I had done this in the
/etc/security/limits.conf
file, so I should be OK.
Adding to the limit there ups the limit for all processes.
Thanks,
- Jim C.
> Jim,
> We placed the ulimit command into our pbs_mom startup script to execute
> before mom starts.
>
> To see if you're running into the same issue, try running an interactive
> qsub job on one node and typing 'limit' on the node you're sent to, or
> adding a 'limit' command to your pbs script.
>
> I was seeing the following:
> cputime unlimited
> filesize unlimited
> datasize unlimited
> stacksize 8192 kbytes
>
> coredumpsize 0 kbytes
> memoryuse unlimited
> vmemoryuse 19096560 kbytes
> descriptors 32768
> memorylocked 32 kbytes - bad!
> maxproc 73728
>
> After setting the locked limit to unlimited you should see something
> like th is:
> cputime unlimited
> filesize unlimited
> datasize unlimited
> stacksize 8192 kbytes
> coredumpsize 0 kbytes
> memoryuse unlimited
> vmemoryuse 19096560 kbytes
> descriptors 1024
> memorylocked 7500000 kbytes
> maxproc 73728
>
>
> The memorylocked limit will only show its face if you check the limit
> command from inside a pbs job.
>
> -----Original Message-----
> From: James J Coyle [mailto:jjc at iastate.edu]
> Sent: Friday, February 29, 2008 11:10 AM
> To: Edsall, William (WJ)
> Cc: torqueusers at supercluster.org
> Subject: Re: [torqueusers] Fluent Infiniband jobs fail, only in PBS
>
> William,
>
> We run Fluent her through PBS also.
>
> I'm not sure I understood your fix.
>
> The line
>
> ulimit -l unlimited
>
> appears to be a (Bourne) shell command, but pbs_mom is an executable
> built from a C routine.
>
> Do you mean that you placed this command
> in the /var/spool/torque/mom_priv/prologue file?
>
> Somewhere else?
>
> Thanks,
> - Jim Coyle
>
>
> > This is a multi-part message in MIME format.
> >
> > --===============0426964802==
> > Content-class: urn:content-classes:message
> > Content-Type: multipart/alternative;
> > boundary="----_=_NextPart_001_01C87AEA.9835786D"
> >
> > This is a multi-part message in MIME format.
> >
> > ------_=_NextPart_001_01C87AEA.9835786D
> > Content-Type: text/plain;
> > charset="us-ascii"
> > Content-Transfer-Encoding: quoted-printable
> >
> > Hello again,
> > I've answered my own question but wanted to follow up with everyone
> and
> > share the result.
> > =20
> > Basically pbs_mom was being memorylocked to 32k, so this explained the
> > strange behavior of jobs run through PBS.
> > =20
> > The resolution was adding the following line to pbs_mom:
> > ulimit -l unlimited
> > =20
> > And ofcourse, restarting pbs_mom.
> >
> >
> > ________________________________
> >
> > From: torqueusers-bounces at supercluster.org
> > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Edsall,
> > William (WJ)
> > Sent: Thursday, February 28, 2008 3:01 PM
> > To: torqueusers at supercluster.org
> > Subject: [torqueusers] Fluent Infiniband jobs fail, only in PBS
> > =09
> > =09
> >
> > Hello!=20
> > I'm experiencing a strange issue with PBS and the application
> > Fluent. We experienced a power outage, and have since had trouble
> > running Fluent through PBS over Infiniband.
> >
> > - Fluent runs fine through PBS on Ethernet=20
> > - Fluent runs fine outside of PBS on Infiniband=20
> > - Fluent only fails when run through PBS, over Infiniband=20
> >
> > Any suggestions? Even if I run PBS with the -I switch, I can't
> > run fluent successfully over infiniband. Something environmentally
> > changed by PBS is causing MPI to fail. My PBS version is 2.1.7.
> >
> > Here is the failure result of a job on 2 nodes running through
> > PBS:=20
> > Host spawning Node 0 on machine "node30" (unix).=20
> > /apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35 3ddp
> > -node -t16 -pib -mpi=3Dhp -cnf=3D/home/u396929/fluent_test/nodes2
> -mport
> > 192.168.0.30:192.168.0.30:46683:0
> >
> > Starting
> >
> /apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun
> > -prot -IBV -e MPI_HASIC_IBV=3D1 -f /tmp/fluent-appfile.8049
> >
> > fluent_mpi.6.3.35: Rank 0:4: MPI_Init: ibv_create_qp() failed=20
> > fluent_mpi.6.3.35: Rank 0:4: MPI_Init: probably you need to
> > increase pinnable memory in /etc/security/limits.conf=20
> > fluent_mpi.6.3.35: Rank 0:4: MPI_Init: Can't initialize RDMA
> > device=20
> > fluent_mpi.6.3.35: Rank 0:4: MPI_Init: MPI BUG: Cannot
> > initialize RDMA protocol=20
> > MPI Application rank 4 exited before MPI_Init() with status 1=20
> >
> >
> > Here is the success result from the command line (same
> nodes):=20
> > Host spawning Node 0 on machine "node30" (unix).=20
> > /apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35 3ddp
> > -node -t16 -pib -mpi=3Dhp -cnf=3D/home/u396929/fluent_test/nodes2
> -mport
> > 192.168.0.30:192.168.0.30:43334:0
> >
> > Starting
> >
> /apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun
> > -prot -IBV -e MPI_HASIC_IBV=3D1 -f /tmp/fluent-appfile.8871
> >
> > HP-MPI licensed for Fluent.=20
> > Host 0 -- ip 192.168.0.30 -- ranks 0 - 7=20
> > Host 1 -- ip 192.168.0.31 -- ranks 8 - 15=20
> >
> > Please help! Thanks much!=20
> >
> > William=20
> >
> >
> > ------_=_NextPart_001_01C87AEA.9835786D
> > Content-Type: text/html;
> > charset="us-ascii"
> > Content-Transfer-Encoding: quoted-printable
> >
> > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
> > <HTML><HEAD><TITLE>Fluent Infiniband jobs fail, only in PBS</TITLE>
> > <META http-equiv=3DContent-Type content=3D"text/html; =
> > charset=3Dus-ascii">
> > <META content=3D"MSHTML 6.00.2900.3243" name=3DGENERATOR></HEAD>
> > <BODY>
> > <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> > face=3DVerdana=20
> > size=3D2>Hello again,</FONT></SPAN></DIV>
> > <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> > face=3DVerdana=20
> > size=3D2> I've answered my own question but wanted to follow up =
> > with everyone=20
> > and share the result.</FONT></SPAN></DIV>
> > <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> > face=3DVerdana=20
> > size=3D2></FONT></SPAN> </DIV>
> > <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> > face=3DVerdana=20
> > size=3D2>Basically pbs_mom was being memorylocked to 32k, so this =
> > explained the=20
> > strange behavior of jobs run through PBS.</FONT></SPAN></DIV>
> > <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> > face=3DVerdana=20
> > size=3D2></FONT></SPAN> </DIV>
> > <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> > face=3DVerdana=20
> > size=3D2>The resolution was adding the following line to=20
> > pbs_mom:</FONT></SPAN></DIV>
> > <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> > face=3DVerdana=20
> > size=3D2>ulimit -l unlimited</FONT></SPAN></DIV>
> > <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> > face=3DVerdana=20
> > size=3D2></FONT></SPAN> </DIV>
> > <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> > face=3DVerdana=20
> > size=3D2>And ofcourse, restarting pbs_mom.</FONT></SPAN></DIV><BR>
> > <BLOCKQUOTE dir=3Dltr style=3D"MARGIN-RIGHT: 0px">
> > <DIV class=3DOutlookMessageHeader lang=3Den-us dir=3Dltr
> align=3Dleft>
> > <HR tabIndex=3D-1>
> > <FONT face=3DTahoma size=3D2><B>From:</B> =
> > torqueusers-bounces at supercluster.org=20
> > [mailto:torqueusers-bounces at supercluster.org] <B>On Behalf Of =
> > </B>Edsall,=20
> > William (WJ)<BR><B>Sent:</B> Thursday, February 28, 2008 3:01 =
> > PM<BR><B>To:</B>=20
> > torqueusers at supercluster.org<BR><B>Subject:</B> [torqueusers]
> Fluent=20
> > Infiniband jobs fail, only in PBS<BR></FONT><BR></DIV>
> > <DIV></DIV><!-- Converted from text/rtf format -->
> > <P><FONT face=3DVerdana size=3D2>Hello!</FONT> <BR><FONT =
> > face=3DVerdana=20
> > size=3D2> I'm experiencing a strange issue with PBS and the =
> > application=20
> > Fluent. We experienced a power outage, and have since had trouble =
> > running=20
> > Fluent through PBS over Infiniband.</FONT></P>
> > <P><FONT face=3DVerdana size=3D2>- Fluent runs fine through PBS on =
> > Ethernet</FONT>=20
> > <BR><FONT face=3DVerdana size=3D2>- Fluent runs fine outside of PBS
> on =
> >
> > Infiniband</FONT> <BR><FONT face=3DVerdana size=3D2>- Fluent only =
> > fails when run=20
> > through PBS, over Infiniband</FONT> </P>
> > <P><FONT face=3DVerdana size=3D2>Any suggestions? Even if I run PBS
> =
> > with the -I=20
> > switch, I can't run fluent successfully over infiniband.
> Something=20
> > environmentally changed by PBS is causing MPI to fail. My PBS
> version =
> > is=20
> > 2.1.7.</FONT></P>
> > <P><FONT face=3DVerdana size=3D2>Here is the failure result of a job
> =
> > on 2 nodes=20
> > running through PBS:</FONT> <BR><FONT face=3DVerdana size=3D2>Host =
> > spawning Node 0=20
> > on machine "node30" (unix).</FONT> <BR><FONT face=3DVerdana=20
> > size=3D2>/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35
> 3ddp =
> > -node=20
> > -t16 -pib -mpi=3Dhp -cnf=3D/home/u396929/fluent_test/nodes2
> -mport=20
> > 192.168.0.30:192.168.0.30:46683:0</FONT></P>
> > <P><FONT face=3DVerdana size=3D2>Starting=20
> > =
> >
> /apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun
> =
> > -prot=20
> > -IBV -e MPI_HASIC_IBV=3D1 -f /tmp/fluent-appfile.8049</FONT></P>
> > <P><FONT face=3DVerdana size=3D2>fluent_mpi.6.3.35: Rank 0:4: =
> > MPI_Init:=20
> > ibv_create_qp() failed</FONT> <BR><FONT face=3DVerdana =
> > size=3D2>fluent_mpi.6.3.35:=20
> > Rank 0:4: MPI_Init: probably you need to increase pinnable memory
> in=20
> > /etc/security/limits.conf</FONT> <BR><FONT face=3DVerdana=20
> > size=3D2>fluent_mpi.6.3.35: Rank 0:4: MPI_Init: Can't initialize
> RDMA=20
> > device</FONT> <BR><FONT face=3DVerdana size=3D2>fluent_mpi.6.3.35: =
> > Rank 0:4:=20
> > MPI_Init: MPI BUG: Cannot initialize RDMA protocol</FONT>
> <BR><FONT=20
> > face=3DVerdana size=3D2>MPI Application rank 4 exited before =
> > MPI_Init() with=20
> > status 1</FONT> </P><BR>
> > <P><FONT face=3DVerdana size=3D2>Here is the success result from the
> =
> > command line=20
> > (same nodes):</FONT> <BR><FONT face=3DVerdana size=3D2>Host spawning
> =
> > Node 0 on=20
> > machine "node30" (unix).</FONT> <BR><FONT face=3DVerdana=20
> > size=3D2>/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35
> 3ddp =
> > -node=20
> > -t16 -pib -mpi=3Dhp -cnf=3D/home/u396929/fluent_test/nodes2
> -mport=20
> > 192.168.0.30:192.168.0.30:43334:0</FONT></P>
> > <P><FONT face=3DVerdana size=3D2>Starting=20
> > =
> >
> /apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun
> =
> > -prot=20
> > -IBV -e MPI_HASIC_IBV=3D1 -f /tmp/fluent-appfile.8871</FONT></P>
> > <P><FONT face=3DVerdana size=3D2>HP-MPI licensed for Fluent.</FONT>
> =
> > <BR><FONT=20
> > face=3DVerdana size=3D2>Host 0 -- ip 192.168.0.30 -- ranks 0 - =
> > 7</FONT> <BR><FONT=20
> > face=3DVerdana size=3D2>Host 1 -- ip 192.168.0.31 -- ranks 8 - =
> > 15</FONT> </P>
> > <P><FONT face=3DVerdana size=3D2>Please help! Thanks much!</FONT>
> </P>
> > <P><FONT face=3DVerdana size=3D2>William</FONT> =
> > </P></BLOCKQUOTE></BODY></HTML>
> >
> > ------_=_NextPart_001_01C87AEA.9835786D--
> >
> > --===============0426964802==
> > Content-Type: text/plain; charset="us-ascii"
> > MIME-Version: 1.0
> > Content-Transfer-Encoding: 7bit
> > Content-Disposition: inline
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> > --===============0426964802==--
> >
>
>
More information about the torqueusers
mailing list