[torqueusers] pbs_mom dies on exit of interactive session

Roy Dragseth roy.dragseth at cc.uit.no
Mon Apr 30 14:39:39 MDT 2012

No, problem. I should probably have used the bugzilla for these kind of things 

I'll try to scramble together a more thorough bug report on this matter.

BTW, is 4.0.X supposed to be config compatible with 3.0.X?  That is, if 3.0.X 
works fine, can I assume that the same config will work on 4.0.X.  Would any 
differences in behaviour indicate a bug is present?   The reason I'm asking is 
that I have a fairly solid setup for v3.0.4 in my torque-roll for Rocks, but 
I'm having a hard time to get 4.0.X working in the same environment.  Being 
pressed for time I would like a hint on where to look for problems.


On Monday 30. April 2012 09.59.39 David Beer wrote:


I'm sorry your message wasn't attended to - its the nature of a mailing list 
that sometimes messages are lost. Can you (or Steve) describe the TM connect 
issues in a bit more detail? We should get a bug report filed for this.


On Sun, Apr 29, 2012 at 11:03 AM, Roy Dragseth <roy.dragseth at cc.uit.no> wrote:

I can confirm this too (reported the same problem to the mailing list a month 
ago, but nobody seemed to care). Torque 3.0.4 works fine with exactly the same 
system config. 
On Saturday 28. April 2012 03.21.05 DuChene, StevenX A wrote:

I am running torque-4.0.1 that I pulled from the svn 4.0.1 branch just today.
Earlier today I was running the 4.0-fixes tree from 04/03 and I had the same 
I was hoping the update to current sources would fix these problems but no such 
If I run the following:
qsub -I -l nodes=7 -l arch=atomN570
from my pbs job submission host I get:
qsub: waiting for job 4.login2.sep.here to start
qsub: job 4.login2.sep.here ready
and then I get a shell prompt on the node 0 of this job.
If I then do:
And then:
$ cat /var/spool/torque/aux//4.login2.sep.here
and then I try:
$ pbsdsh -h atom254 ls /tmp
pbsdsh: error from tm_poll() 17002
Alternatively if I use the –v option it says:
$ pbsdsh -v -h atom254 /bin/ls /tmp
pbsdsh: tm_init failed, rc = TM_ESYSTEM (17000)
Then when I exit the shell I get back to my job submission node but when I 
then check for the pbs_mom process on that node 0 for the job that I just had 
the shell prompt on, the pbs_mom process is no longer running and the status 
message from the init script says: 
pbs_mom dead but subsys locked
Also if I run any mpi jobs through torque that are run on cpu cores on one 
system, they seem to run just fine but for mpi jobs where the job spans across 
separate nodes I get the following error out each time: 
$ cat script_viking_7nodes.pbs.e12
[viking12.sep.here:12360] plm:tm: failed to poll for a spawned daemon, return 
status = 17002
A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
        viking11.sep.here - daemon did not report back when launched
The PBS job script for this MPI job is very simple:
#PBS -l nodes=7:Viking:ppn=1
#PBS -l arch=xeon
mpirun --machinefile $PBS_NODEFILE /home/sadX/mpi_test/mpi_hello_hostname > 
Also if I ssh to any node in my cluster and create a mpi nodes file by hand and 
then use that to run like this:
mpirun --machinefile ../nodefile ./mpi_hello_hostname
I get back all of the expected output from all of the nodes listed in the 
manually created nodes file.
Steven DuChene

The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
phone:+47 77 64 41 07, fax:+47 77 64 41 00 
Roy Dragseth, Team Leader, High Performance Computing
Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no 

torqueusers mailing list
torqueusers at supercluster.org


David Beer | Software Engineer
Adaptive Computing


  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
        Roy Dragseth, Team Leader, High Performance Computing
	 Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120430/5f5cb26a/attachment-0001.html 

More information about the torqueusers mailing list