[torqueusers] pbs_mom dies on exit of interactive session

David Beer dbeer at adaptivecomputing.com
Mon Apr 30 09:59:39 MDT 2012


Roy,

I'm sorry your message wasn't attended to - its the nature of a mailing
list that sometimes messages are lost. Can you (or Steve) describe the TM
connect issues in a bit more detail? We should get a bug report filed for
this.

David

On Sun, Apr 29, 2012 at 11:03 AM, Roy Dragseth <roy.dragseth at cc.uit.no>wrote:

> **
>
> I can confirm this too (reported the same problem to the mailing list a
> month ago, but nobody seemed to care). Torque 3.0.4 works fine with exactly
> the same system config.
>
>
>
> r.
>
>
>
>
>
> On Saturday 28. April 2012 03.21.05 DuChene, StevenX A wrote:
>
> I am running torque-4.0.1 that I pulled from the svn 4.0.1 branch just
> today.
>
> Earlier today I was running the 4.0-fixes tree from 04/03 and I had the
> same results.
>
> I was hoping the update to current sources would fix these problems but no
> such luck.
>
>
>
> If I run the following:
>
>
>
> qsub -I -l nodes=7 -l arch=atomN570
>
>
>
> from my pbs job submission host I get:
>
>
>
> qsub: waiting for job 4.login2.sep.here to start
>
> qsub: job 4.login2.sep.here ready
>
>
>
> and then I get a shell prompt on the node 0 of this job.
>
>
>
> If I then do:
>
>
>
> $ echo $PBS_NODEFILE
>
> /var/spool/torque/aux//4.login2.sep.here
>
>
>
> And then:
>
>
>
> $ cat /var/spool/torque/aux//4.login2.sep.here
>
> atom255
>
> atom255
>
> atom255
>
> atom255
>
> atom254
>
> atom254
>
> atom254
>
>
>
> and then I try:
>
>
>
> $ pbsdsh -h atom254 ls /tmp
>
> pbsdsh: error from tm_poll() 17002
>
>
>
> Alternatively if I use the –v option it says:
>
>
>
> $ pbsdsh -v -h atom254 /bin/ls /tmp
>
> pbsdsh: tm_init failed, rc = TM_ESYSTEM (17000)
>
>
>
> Then when I exit the shell I get back to my job submission node but when I
> then check for the pbs_mom process on that node 0 for the job that I just
> had the shell prompt on, the pbs_mom process is no longer running and the
> status message from the init script says:
>
>
>
> pbs_mom dead but subsys locked
>
>
>
> Also if I run any mpi jobs through torque that are run on cpu cores on one
> system, they seem to run just fine but for mpi jobs where the job spans
> across separate nodes I get the following error out each time:
>
>
>
> $ cat script_viking_7nodes.pbs.e12
>
> [viking12.sep.here:12360] plm:tm: failed to poll for a spawned daemon,
> return status = 17002
>
> --------------------------------------------------------------------------
>
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>
> launch so we are aborting.
>
>
>
> There may be more information reported by the environment (see above).
>
>
>
> This may be because the daemon was unable to find all the needed shared
>
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>
> location of the shared libraries on the remote nodes and this will
>
> automatically be forwarded to the remote nodes.
>
> --------------------------------------------------------------------------
>
> --------------------------------------------------------------------------
>
> mpirun noticed that the job aborted, but has no info as to the process
>
> that caused that situation.
>
> --------------------------------------------------------------------------
>
> --------------------------------------------------------------------------
>
> mpirun was unable to cleanly terminate the daemons on the nodes shown
>
> below. Additional manual cleanup may be required - please refer to
>
> the "orte-clean" tool for assistance.
>
> --------------------------------------------------------------------------
>
>         viking11.sep.here - daemon did not report back when launched
>
>
>
> The PBS job script for this MPI job is very simple:
>
>
>
> #PBS -l nodes=7:Viking:ppn=1
>
> #PBS -l arch=xeon
>
>
>
> mpirun --machinefile $PBS_NODEFILE /home/sadX/mpi_test/mpi_hello_hostname
> > hello_out_viking7nodes
>
>
>
> Also if I ssh to any node in my cluster and create a mpi nodes file by
> hand and then use that to run like this:
>
>
>
> mpirun --machinefile ../nodefile ./mpi_hello_hostname
>
>
>
> I get back all of the expected output from all of the nodes listed in the
> manually created nodes file.
>
> --
>
> Steven DuChene
>
>
>
>
>
> --
>
>
>
> The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
>
> phone:+47 77 64 41 07, fax:+47 77 64 41 00
>
> Roy Dragseth, Team Leader, High Performance Computing
>
> Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120430/5fab8446/attachment-0001.html 


More information about the torqueusers mailing list