[torqueusers] Fluent Infiniband jobs fail, only in PBS

Edsall, William (WJ) WJEdsall at dow.com
Fri Feb 29 08:48:56 MST 2008

Hello again,
 I've answered my own question but wanted to follow up with everyone and
share the result.
Basically pbs_mom was being memorylocked to 32k, so this explained the
strange behavior of jobs run through PBS.
The resolution was adding the following line to pbs_mom:
ulimit -l unlimited
And ofcourse, restarting pbs_mom.


	From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Edsall,
William (WJ)
	Sent: Thursday, February 28, 2008 3:01 PM
	To: torqueusers at supercluster.org
	Subject: [torqueusers] Fluent Infiniband jobs fail, only in PBS

	  I'm experiencing a strange issue with PBS and the application
Fluent. We experienced a power outage, and have since had trouble
running Fluent through PBS over Infiniband.

	- Fluent runs fine through PBS on Ethernet 
	- Fluent runs fine outside of PBS on Infiniband 
	- Fluent only fails when run through PBS, over Infiniband 

	Any suggestions? Even if I run PBS with the -I switch, I can't
run fluent successfully over infiniband. Something environmentally
changed by PBS is causing MPI to fail. My PBS version is 2.1.7.

	Here is the failure result of a job on 2 nodes running through
	Host spawning Node 0 on machine "node30" (unix). 
	/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35 3ddp
-node -t16 -pib -mpi=hp -cnf=/home/u396929/fluent_test/nodes2 -mport

-prot -IBV -e MPI_HASIC_IBV=1 -f /tmp/fluent-appfile.8049

	fluent_mpi.6.3.35: Rank 0:4: MPI_Init: ibv_create_qp() failed 
	fluent_mpi.6.3.35: Rank 0:4: MPI_Init: probably you need to
increase pinnable memory in /etc/security/limits.conf 
	fluent_mpi.6.3.35: Rank 0:4: MPI_Init: Can't initialize RDMA
	fluent_mpi.6.3.35: Rank 0:4: MPI_Init: MPI BUG: Cannot
initialize RDMA protocol 
	MPI Application rank 4 exited before MPI_Init() with status 1 

	Here is the success result from the command line (same nodes): 
	Host spawning Node 0 on machine "node30" (unix). 
	/apps/fluent/Fluent.Inc/fluent6.3.35/bin/fluent -r6.3.35 3ddp
-node -t16 -pib -mpi=hp -cnf=/home/u396929/fluent_test/nodes2 -mport

-prot -IBV -e MPI_HASIC_IBV=1 -f /tmp/fluent-appfile.8871

	HP-MPI licensed for Fluent. 
	Host 0 -- ip -- ranks 0 - 7 
	Host 1 -- ip -- ranks 8 - 15 

	Please help! Thanks much! 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080229/1fe58076/attachment.html

More information about the torqueusers mailing list