[torqueusers] Torque environment problem

Svancara, Randall rsvancara at wsu.edu
Fri Mar 18 21:36:59 MDT 2011


I just wanted to add that if I launch a job on one node, everything works fine.  For example in my job script if I specify


#PBS -l nodes=1:ppn=12

Then everything runs fine.


However, if I specify two nodes, then everything fails.  


#PBS -l nodes=1:ppn=12

This also fails


#PBS -l nodes=13

But this does not:


#PBS -l nodes=12

Thanks,

Randall

-----Original Message-----
From: torqueusers-bounces at supercluster.org on behalf of Svancara, Randall
Sent: Fri 3/18/2011 7:48 PM
To: torqueusers at supercluster.org
Subject: [torqueusers] Torque environment problem
 

Hi,

We are in the process of setting up a new cluster.   One issue I am experiencing is with openmpi jobs launched through torque.  

When I launch a simple job using a very basic mpi "Hello World" script I am seeing the following errors from openmpi:

**************************

[node164:06689] plm:tm: failed to poll for a spawned daemon, return status = 17002
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
        node163 - daemon did not report back when launched
Completed executing:

*************************

However when launch a job running mpiexec, everything seems to work fine using the following script:

/usr/mpi/intel/openmpi-1.4.3/bin/mpirun -hostfile /home/admins/rsvancara/hosts -n 24 /home/admins/rsvancara/TEST/mpitest

The job runs on 24 nodes with 12 processes per node.  

I have verified that my .bashrc is working.  I have tried to launch from an interactive job using qsub -I -lnodes=12:ppn12 without any success.  I am assuming this is an environment problem, however, I am unsure as the openmpi error includes "MAY".   

My question is:

1.  Has anyone had this problem before (I am sure they have)
2.  How would I go about troubleshooting this problem.  


I am using torque version 2.4.7.

Thanks for any assistance anyone can provide.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110318/e839a2ba/attachment.html 


More information about the torqueusers mailing list