[torqueusers] pbs_mom dies on exit of interactive session

Michel Jouvin jouvin at lal.in2p3.fr
Mon Apr 30 14:55:51 MDT 2012


Ca me fait réaliser que l'appel des bennes est passé à l'as aujourd'hui. 
Mon muezin n'a pas sonné... Je fais plusieurs noeuds à mon mouchoir pour 
mercredi matin...

Michel

--On lundi 30 avril 2012 22:39 +0200 Roy Dragseth <roy.dragseth at cc.uit.no> 
wrote:

> No, problem. I should probably have used the bugzilla for these kind of
> things  anyway.
>
> I'll try to scramble together a more thorough bug report on this matter.
>
> BTW, is 4.0.X supposed to be config compatible with 3.0.X?  That is, if
> 3.0.X  works fine, can I assume that the same config will work on 4.0.X.
> Would any  differences in behaviour indicate a bug is present?   The
> reason I'm asking is  that I have a fairly solid setup for v3.0.4 in my
> torque-roll for Rocks, but  I'm having a hard time to get 4.0.X working
> in the same environment.  Being  pressed for time I would like a hint on
> where to look for problems.
>
> Regards,
> r.
>
>
> On Monday 30. April 2012 09.59.39 David Beer wrote:
>
> Roy,
>
>
> I'm sorry your message wasn't attended to - its the nature of a mailing
> list  that sometimes messages are lost. Can you (or Steve) describe the
> TM connect  issues in a bit more detail? We should get a bug report filed
> for this.
>
>
> David
>
>
> On Sun, Apr 29, 2012 at 11:03 AM, Roy Dragseth <roy.dragseth at cc.uit.no>
> wrote:
>
> I can confirm this too (reported the same problem to the mailing list a
> month  ago, but nobody seemed to care). Torque 3.0.4 works fine with
> exactly the same  system config.
>
> r.
>
>
> On Saturday 28. April 2012 03.21.05 DuChene, StevenX A wrote:
>
> I am running torque-4.0.1 that I pulled from the svn 4.0.1 branch just
> today. Earlier today I was running the 4.0-fixes tree from 04/03 and I
> had the same  results.
> I was hoping the update to current sources would fix these problems but
> no such  luck.
>
> If I run the following:
>
> qsub -I -l nodes=7 -l arch=atomN570
>
> from my pbs job submission host I get:
>
> qsub: waiting for job 4.login2.sep.here to start
> qsub: job 4.login2.sep.here ready
>
> and then I get a shell prompt on the node 0 of this job.
>
> If I then do:
>
> $ echo $PBS_NODEFILE
> /var/spool/torque/aux//4.login2.sep.here
>
> And then:
>
> $ cat /var/spool/torque/aux//4.login2.sep.here
> atom255
> atom255
> atom255
> atom255
> atom254
> atom254
> atom254
>
> and then I try:
>
> $ pbsdsh -h atom254 ls /tmp
> pbsdsh: error from tm_poll() 17002
>
> Alternatively if I use the –v option it says:
>
> $ pbsdsh -v -h atom254 /bin/ls /tmp
> pbsdsh: tm_init failed, rc = TM_ESYSTEM (17000)
>
> Then when I exit the shell I get back to my job submission node but when
> I  then check for the pbs_mom process on that node 0 for the job that I
> just had  the shell prompt on, the pbs_mom process is no longer running
> and the status  message from the init script says:
>
> pbs_mom dead but subsys locked
>
> Also if I run any mpi jobs through torque that are run on cpu cores on
> one  system, they seem to run just fine but for mpi jobs where the job
> spans across  separate nodes I get the following error out each time:
>
> $ cat script_viking_7nodes.pbs.e12
> [viking12.sep.here:12360] plm:tm: failed to poll for a spawned daemon,
> return  status = 17002
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> --------------------------------------------------------------------------
>         viking11.sep.here - daemon did not report back when launched
>
> The PBS job script for this MPI job is very simple:
>
># PBS -l nodes=7:Viking:ppn=1
># PBS -l arch=xeon
>
> mpirun --machinefile $PBS_NODEFILE /home/sadX/mpi_test/mpi_hello_hostname
> >  hello_out_viking7nodes
>
> Also if I ssh to any node in my cluster and create a mpi nodes file by
> hand and  then use that to run like this:
>
> mpirun --machinefile ../nodefile ./mpi_hello_hostname
>
> I get back all of the expected output from all of the nodes listed in the
> manually created nodes file.
> --
> Steven DuChene
>
>
>
>
> --
>
> The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
> phone:+47 77 64 41 07, fax:+47 77 64 41 00
> Roy Dragseth, Team Leader, High Performance Computing
> Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
>
>
>
> --
>
> David Beer | Software Engineer
> Adaptive Computing
>
>
>
>
> --
>
>   The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
> 	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
>         Roy Dragseth, Team Leader, High Performance Computing
> 	 Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no



     *************************************************************
     * Michel Jouvin                 Email : jouvin at lal.in2p3.fr *
     * LAL / CNRS                    Tel : +33 1 64468932        *
     * B.P. 34                       Fax : +33 1 69079404        *
     * 91898 Orsay Cedex                                         *
     * France                                                    *
     *************************************************************




More information about the torqueusers mailing list