[torqueusers] Fluent Infiniband jobs fail, only in PBS

James J Coyle jjc at iastate.edu
Fri Feb 29 09:54:14 MST 2008


William,

  Thanks for the explanation.

  I had done this in the 
/etc/security/limits.conf
file, so I should be OK.

  Adding to the limit there ups the limit for all processes.

Thanks,
 - Jim C.

> Jim,
>  We placed the ulimit command into our pbs_mom startup script to execute
> before mom starts.
> 
> To see if you're running into the same issue, try running an interactive
> qsub job on one node and typing 'limit' on the node you're sent to, or
> adding a 'limit' command to your pbs script.
> 
> I was seeing the following:
> cputime      unlimited
> filesize     unlimited
> datasize     unlimited
> stacksize    8192 kbytes
> 
> coredumpsize 0 kbytes
> memoryuse    unlimited
> vmemoryuse   19096560 kbytes
> descriptors  32768
> memorylocked 32 kbytes -  bad!
> maxproc      73728
> 
> After setting the locked limit to unlimited you should see something
> like th is:
> cputime      unlimited
> filesize     unlimited
> datasize     unlimited
> stacksize    8192 kbytes
> coredumpsize 0 kbytes
> memoryuse    unlimited
> vmemoryuse   19096560 kbytes
> descriptors  1024
> memorylocked 7500000 kbytes
> maxproc      73728
>  
> 
> The memorylocked limit will only show its face if you check the limit
> command from inside a pbs job.
> 
> -----Original Message-----
> From: James J Coyle [mailto:jjc at iastate.edu] 
> Sent: Friday, February 29, 2008 11:10 AM
> To: Edsall, William (WJ)
> Cc: torqueusers at supercluster.org
> Subject: Re: [torqueusers] Fluent Infiniband jobs fail, only in PBS 
> 
> William,
> 
>    We run Fluent her through PBS also.
> 
>    I'm not sure I understood your fix.
> 
>    The line
> 
>   ulimit -l unlimited
> 
> appears to be a (Bourne) shell command, but pbs_mom is an executable
> built from a C routine.
> 
>   Do you mean that you placed this command
> in the /var/spool/torque/mom_priv/prologue file?
> 
>   Somewhere else?
> 
> Thanks,
>  - Jim Coyle
> 
> 
> > This is a multi-part message in MIME format.
> > 
> > --===============0426964802==
> > Content-class: urn:content-classes:message
> > Content-Type: multipart/alternative;
> > 	boundary="----_=_NextPart_001_01C87AEA.9835786D"
> > 
> > This is a multi-part message in MIME format.
> > 
> > ------_=_NextPart_001_01C87AEA.9835786D
> > Content-Type: text/plain;
> > 	charset="us-ascii"
> > Content-Transfer-Encoding: quoted-printable
> > 
> > Hello again,
> >  I've answered my own question but wanted to follow up with everyone
> and
> > share the result.
> > =20
> > Basically pbs_mom was being memorylocked to 32k, so this explained the
> > strange behavior of jobs run through PBS.
> > =20
> > The resolution was adding the following line to pbs_mom:
> > ulimit -l unlimited
> > =20
> > And ofcourse, restarting pbs_mom.
> > 
> > 
> > ________________________________
> > 
> > 	From: torqueusers-bounces at supercluster.org
> > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Edsall,
> > William (WJ)
> > 	Sent: Thursday, February 28, 2008 3:01 PM
> > 	To: torqueusers at supercluster.org
> > 	Subject: [torqueusers] Fluent Infiniband jobs fail, only in PBS
> > =09
> > =09
> > 
> > 	Hello!=20
> > 	  I'm experiencing a strange issue with PBS and the application
> > Fluent. We experienced a power outage, and have since had trouble
> > running Fluent through PBS over Infiniband.
> > 
> > 	- Fluent runs fine through PBS on Ethernet=20
> > 	- Fluent runs fine outside of PBS on Infiniband=20
> > 	- Fluent only fails when run through PBS, over Infiniband=20
> > 
> > 	Any suggestions? Even if I run PBS with the -I switch, I can't
> > run fluent successfully over infiniband. Something environmentally
> > changed by PBS is causing MPI to fail. My PBS version is 2.1.7.
> > 
> > 	Here is the failure result of a job on 2 nodes running through
> > PBS:=20
> > 	Host spawning Node 0 on machine "node30" (unix).=20
> > 	/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35 3ddp
> > -node -t16 -pib -mpi=3Dhp -cnf=3D/home/u396929/fluent_test/nodes2
> -mport
> > 192.168.0.30:192.168.0.30:46683:0
> > 
> > 	Starting
> >
> /apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun
> > -prot -IBV -e MPI_HASIC_IBV=3D1 -f /tmp/fluent-appfile.8049
> > 
> > 	fluent_mpi.6.3.35: Rank 0:4: MPI_Init: ibv_create_qp() failed=20
> > 	fluent_mpi.6.3.35: Rank 0:4: MPI_Init: probably you need to
> > increase pinnable memory in /etc/security/limits.conf=20
> > 	fluent_mpi.6.3.35: Rank 0:4: MPI_Init: Can't initialize RDMA
> > device=20
> > 	fluent_mpi.6.3.35: Rank 0:4: MPI_Init: MPI BUG: Cannot
> > initialize RDMA protocol=20
> > 	MPI Application rank 4 exited before MPI_Init() with status 1=20
> > 
> > 
> > 	Here is the success result from the command line (same
> nodes):=20
> > 	Host spawning Node 0 on machine "node30" (unix).=20
> > 	/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35 3ddp
> > -node -t16 -pib -mpi=3Dhp -cnf=3D/home/u396929/fluent_test/nodes2
> -mport
> > 192.168.0.30:192.168.0.30:43334:0
> > 
> > 	Starting
> >
> /apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun
> > -prot -IBV -e MPI_HASIC_IBV=3D1 -f /tmp/fluent-appfile.8871
> > 
> > 	HP-MPI licensed for Fluent.=20
> > 	Host 0 -- ip 192.168.0.30 -- ranks 0 - 7=20
> > 	Host 1 -- ip 192.168.0.31 -- ranks 8 - 15=20
> > 
> > 	Please help! Thanks much!=20
> > 
> > 	William=20
> > 
> > 
> > ------_=_NextPart_001_01C87AEA.9835786D
> > Content-Type: text/html;
> > 	charset="us-ascii"
> > Content-Transfer-Encoding: quoted-printable
> > 
> > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
> > <HTML><HEAD><TITLE>Fluent Infiniband jobs fail, only in PBS</TITLE>
> > <META http-equiv=3DContent-Type content=3D"text/html; =
> > charset=3Dus-ascii">
> > <META content=3D"MSHTML 6.00.2900.3243" name=3DGENERATOR></HEAD>
> > <BODY>
> > <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> > face=3DVerdana=20
> > size=3D2>Hello again,</FONT></SPAN></DIV>
> > <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> > face=3DVerdana=20
> > size=3D2>&nbsp;I've answered my own question but wanted to follow up =
> > with everyone=20
> > and share the result.</FONT></SPAN></DIV>
> > <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> > face=3DVerdana=20
> > size=3D2></FONT></SPAN>&nbsp;</DIV>
> > <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> > face=3DVerdana=20
> > size=3D2>Basically pbs_mom was being memorylocked to 32k, so this =
> > explained the=20
> > strange behavior of jobs run through PBS.</FONT></SPAN></DIV>
> > <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> > face=3DVerdana=20
> > size=3D2></FONT></SPAN>&nbsp;</DIV>
> > <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> > face=3DVerdana=20
> > size=3D2>The resolution was adding the following line to=20
> > pbs_mom:</FONT></SPAN></DIV>
> > <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> > face=3DVerdana=20
> > size=3D2>ulimit -l unlimited</FONT></SPAN></DIV>
> > <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> > face=3DVerdana=20
> > size=3D2></FONT></SPAN>&nbsp;</DIV>
> > <DIV dir=3Dltr align=3Dleft><SPAN class=3D711154615-29022008><FONT =
> > face=3DVerdana=20
> > size=3D2>And ofcourse, restarting pbs_mom.</FONT></SPAN></DIV><BR>
> > <BLOCKQUOTE dir=3Dltr style=3D"MARGIN-RIGHT: 0px">
> >   <DIV class=3DOutlookMessageHeader lang=3Den-us dir=3Dltr
> align=3Dleft>
> >   <HR tabIndex=3D-1>
> >   <FONT face=3DTahoma size=3D2><B>From:</B> =
> > torqueusers-bounces at supercluster.org=20
> >   [mailto:torqueusers-bounces at supercluster.org] <B>On Behalf Of =
> > </B>Edsall,=20
> >   William (WJ)<BR><B>Sent:</B> Thursday, February 28, 2008 3:01 =
> > PM<BR><B>To:</B>=20
> >   torqueusers at supercluster.org<BR><B>Subject:</B> [torqueusers]
> Fluent=20
> >   Infiniband jobs fail, only in PBS<BR></FONT><BR></DIV>
> >   <DIV></DIV><!-- Converted from text/rtf format -->
> >   <P><FONT face=3DVerdana size=3D2>Hello!</FONT> <BR><FONT =
> > face=3DVerdana=20
> >   size=3D2>&nbsp; I'm experiencing a strange issue with PBS and the =
> > application=20
> >   Fluent. We experienced a power outage, and have since had trouble =
> > running=20
> >   Fluent through PBS over Infiniband.</FONT></P>
> >   <P><FONT face=3DVerdana size=3D2>- Fluent runs fine through PBS on =
> > Ethernet</FONT>=20
> >   <BR><FONT face=3DVerdana size=3D2>- Fluent runs fine outside of PBS
> on =
> > 
> >   Infiniband</FONT> <BR><FONT face=3DVerdana size=3D2>- Fluent only =
> > fails when run=20
> >   through PBS, over Infiniband</FONT> </P>
> >   <P><FONT face=3DVerdana size=3D2>Any suggestions? Even if I run PBS
> =
> > with the -I=20
> >   switch, I can't run fluent successfully over infiniband.
> Something=20
> >   environmentally changed by PBS is causing MPI to fail. My PBS
> version =
> > is=20
> >   2.1.7.</FONT></P>
> >   <P><FONT face=3DVerdana size=3D2>Here is the failure result of a job
> =
> > on 2 nodes=20
> >   running through PBS:</FONT> <BR><FONT face=3DVerdana size=3D2>Host =
> > spawning Node 0=20
> >   on machine "node30" (unix).</FONT> <BR><FONT face=3DVerdana=20
> >   size=3D2>/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35
> 3ddp =
> > -node=20
> >   -t16 -pib -mpi=3Dhp -cnf=3D/home/u396929/fluent_test/nodes2
> -mport=20
> >   192.168.0.30:192.168.0.30:46683:0</FONT></P>
> >   <P><FONT face=3DVerdana size=3D2>Starting=20
> >   =
> >
> /apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun
> =
> > -prot=20
> >   -IBV -e MPI_HASIC_IBV=3D1 -f /tmp/fluent-appfile.8049</FONT></P>
> >   <P><FONT face=3DVerdana size=3D2>fluent_mpi.6.3.35: Rank 0:4: =
> > MPI_Init:=20
> >   ibv_create_qp() failed</FONT> <BR><FONT face=3DVerdana =
> > size=3D2>fluent_mpi.6.3.35:=20
> >   Rank 0:4: MPI_Init: probably you need to increase pinnable memory
> in=20
> >   /etc/security/limits.conf</FONT> <BR><FONT face=3DVerdana=20
> >   size=3D2>fluent_mpi.6.3.35: Rank 0:4: MPI_Init: Can't initialize
> RDMA=20
> >   device</FONT> <BR><FONT face=3DVerdana size=3D2>fluent_mpi.6.3.35: =
> > Rank 0:4:=20
> >   MPI_Init: MPI BUG: Cannot initialize RDMA protocol</FONT>
> <BR><FONT=20
> >   face=3DVerdana size=3D2>MPI Application rank 4 exited before =
> > MPI_Init() with=20
> >   status 1</FONT> </P><BR>
> >   <P><FONT face=3DVerdana size=3D2>Here is the success result from the
> =
> > command line=20
> >   (same nodes):</FONT> <BR><FONT face=3DVerdana size=3D2>Host spawning
> =
> > Node 0 on=20
> >   machine "node30" (unix).</FONT> <BR><FONT face=3DVerdana=20
> >   size=3D2>/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35
> 3ddp =
> > -node=20
> >   -t16 -pib -mpi=3Dhp -cnf=3D/home/u396929/fluent_test/nodes2
> -mport=20
> >   192.168.0.30:192.168.0.30:43334:0</FONT></P>
> >   <P><FONT face=3DVerdana size=3D2>Starting=20
> >   =
> >
> /apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun
> =
> > -prot=20
> >   -IBV -e MPI_HASIC_IBV=3D1 -f /tmp/fluent-appfile.8871</FONT></P>
> >   <P><FONT face=3DVerdana size=3D2>HP-MPI licensed for Fluent.</FONT>
> =
> > <BR><FONT=20
> >   face=3DVerdana size=3D2>Host 0 -- ip 192.168.0.30 -- ranks 0 - =
> > 7</FONT> <BR><FONT=20
> >   face=3DVerdana size=3D2>Host 1 -- ip 192.168.0.31 -- ranks 8 - =
> > 15</FONT> </P>
> >   <P><FONT face=3DVerdana size=3D2>Please help! Thanks much!</FONT>
> </P>
> >   <P><FONT face=3DVerdana size=3D2>William</FONT> =
> > </P></BLOCKQUOTE></BODY></HTML>
> > 
> > ------_=_NextPart_001_01C87AEA.9835786D--
> > 
> > --===============0426964802==
> > Content-Type: text/plain; charset="us-ascii"
> > MIME-Version: 1.0
> > Content-Transfer-Encoding: 7bit
> > Content-Disposition: inline
> > 
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> > 
> > --===============0426964802==--
> > 
> 
> 







More information about the torqueusers mailing list