[torqueusers] Fluent Infiniband jobs fail, only in PBS
Edsall, William (WJ)
WJEdsall at dow.com
Fri Feb 29 08:48:56 MST 2008
Hello again,
I've answered my own question but wanted to follow up with everyone and
share the result.
Basically pbs_mom was being memorylocked to 32k, so this explained the
strange behavior of jobs run through PBS.
The resolution was adding the following line to pbs_mom:
ulimit -l unlimited
And ofcourse, restarting pbs_mom.
________________________________
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Edsall,
William (WJ)
Sent: Thursday, February 28, 2008 3:01 PM
To: torqueusers at supercluster.org
Subject: [torqueusers] Fluent Infiniband jobs fail, only in PBS
Hello!
I'm experiencing a strange issue with PBS and the application
Fluent. We experienced a power outage, and have since had trouble
running Fluent through PBS over Infiniband.
- Fluent runs fine through PBS on Ethernet
- Fluent runs fine outside of PBS on Infiniband
- Fluent only fails when run through PBS, over Infiniband
Any suggestions? Even if I run PBS with the -I switch, I can't
run fluent successfully over infiniband. Something environmentally
changed by PBS is causing MPI to fail. My PBS version is 2.1.7.
Here is the failure result of a job on 2 nodes running through
PBS:
Host spawning Node 0 on machine "node30" (unix).
/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35 3ddp
-node -t16 -pib -mpi=hp -cnf=/home/u396929/fluent_test/nodes2 -mport
192.168.0.30:192.168.0.30:46683:0
Starting
/apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun
-prot -IBV -e MPI_HASIC_IBV=1 -f /tmp/fluent-appfile.8049
fluent_mpi.6.3.35: Rank 0:4: MPI_Init: ibv_create_qp() failed
fluent_mpi.6.3.35: Rank 0:4: MPI_Init: probably you need to
increase pinnable memory in /etc/security/limits.conf
fluent_mpi.6.3.35: Rank 0:4: MPI_Init: Can't initialize RDMA
device
fluent_mpi.6.3.35: Rank 0:4: MPI_Init: MPI BUG: Cannot
initialize RDMA protocol
MPI Application rank 4 exited before MPI_Init() with status 1
Here is the success result from the command line (same nodes):
Host spawning Node 0 on machine "node30" (unix).
/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35 3ddp
-node -t16 -pib -mpi=hp -cnf=/home/u396929/fluent_test/nodes2 -mport
192.168.0.30:192.168.0.30:43334:0
Starting
/apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun
-prot -IBV -e MPI_HASIC_IBV=1 -f /tmp/fluent-appfile.8871
HP-MPI licensed for Fluent.
Host 0 -- ip 192.168.0.30 -- ranks 0 - 7
Host 1 -- ip 192.168.0.31 -- ranks 8 - 15
Please help! Thanks much!
William
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080229/1fe58076/attachment.html
More information about the torqueusers
mailing list