[torqueusers] Fluent Infiniband jobs fail, only in PBS

Edsall, William (WJ) WJEdsall at dow.com
Thu Feb 28 13:01:19 MST 2008


Hello!
  I'm experiencing a strange issue with PBS and the application Fluent.
We experienced a power outage, and have since had trouble running Fluent
through PBS over Infiniband.

- Fluent runs fine through PBS on Ethernet
- Fluent runs fine outside of PBS on Infiniband
- Fluent only fails when run through PBS, over Infiniband

Any suggestions? Even if I run PBS with the -I switch, I can't run
fluent successfully over infiniband. Something environmentally changed
by PBS is causing MPI to fail. My PBS version is 2.1.7.

Here is the failure result of a job on 2 nodes running through PBS:
Host spawning Node 0 on machine "node30" (unix).
/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35 3ddp -node -t16
-pib -mpi=hp -cnf=/home/u396929/fluent_test/nodes2 -mport
192.168.0.30:192.168.0.30:46683:0
Starting
/apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun
-prot -IBV -e MPI_HASIC_IBV=1 -f /tmp/fluent-appfile.8049
fluent_mpi.6.3.35: Rank 0:4: MPI_Init: ibv_create_qp() failed
fluent_mpi.6.3.35: Rank 0:4: MPI_Init: probably you need to increase
pinnable memory in /etc/security/limits.conf
fluent_mpi.6.3.35: Rank 0:4: MPI_Init: Can't initialize RDMA device
fluent_mpi.6.3.35: Rank 0:4: MPI_Init: MPI BUG: Cannot initialize RDMA
protocol
MPI Application rank 4 exited before MPI_Init() with status 1


Here is the success result from the command line (same nodes):
Host spawning Node 0 on machine "node30" (unix).
/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35 3ddp -node -t16
-pib -mpi=hp -cnf=/home/u396929/fluent_test/nodes2 -mport
192.168.0.30:192.168.0.30:43334:0
Starting
/apps/fluent/Fluent.Inc/fluent6.3.35/multiport/mpi/lnamd64/hp/bin/mpirun
-prot -IBV -e MPI_HASIC_IBV=1 -f /tmp/fluent-appfile.8871
HP-MPI licensed for Fluent.
Host 0 -- ip 192.168.0.30 -- ranks 0 - 7
Host 1 -- ip 192.168.0.31 -- ranks 8 - 15

Please help! Thanks much!

William

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080228/bf5e5ea7/attachment.html


More information about the torqueusers mailing list