[torqueusers] pbs_mom dies on exit of interactive session

DuChene, StevenX A stevenx.a.duchene at intel.com
Mon Apr 30 14:24:58 MDT 2012


I added a unique feature to two nodes (one unique feature for each node):

qmgr -c "set node atom221 properties += yestest"
qmgr -c "set node atom231 properties += testdie"

Then as a regular user I submitted an interactive job like this:
qsub -I -l nodes=1:testdie+1:yestest

I got this back:

qsub: waiting for job 25.login2.sep.here to start
qsub: job 25.login2.sep.here ready
and then I got a shell on one of the nodes with the unique features.

I checked the contents of the $PBS_NODEFILE:

echo $PBS_NODEFILE
/var/spool/torque/aux//25.login2.sep.here

$ cat /var/spool/torque/aux//25.login2.sep.here
atom231
atom221

Using pbsdsh I tried to run a command on the other node I had been assigned:

$ pbsdsh -v -h atom221 ls /tmp
pbsdsh: rescinfo from 0: Linux atom231 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22 EST 2012 x86_64:nodes=1:testdie+1:yestest,walltime=01:00:00
pbsdsh: error from tm_poll() 17002

If I immediately try the same command again I get:

$ pbsdsh -v -h atom221 ls /tmp
pbsdsh: tm_init failed, rc = TM_ESYSTEM (17000)

The output from "print server" is:

# qmgr -c "p s"
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = login2
set server managers = maui at login2.sep.here
set server managers += root at login2.sep.here
set server operators = maui at login2.sep.here
set server operators += root at login2.sep.here
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 45
set server poll_jobs = True
set server mom_job_sync = True
set server keep_completed = 300
set server submit_hosts = elogin1
set server next_job_number = 26
set server moab_array_compatible = True

I don't have anything unusual in my maui.cfg file:

# cat /usr/local/maui/maui.cfg |grep -v ^$|grep -v ^#
SERVERHOST            login2
ADMIN1                maui root
RMCFG[ELOGIN2] TYPE=PBS
AMCFG[bank]  TYPE=NONE
RMPOLLINTERVAL        00:00:30
SERVERPORT            42559
SERVERMODE            NORMAL
LOGFILE               maui.log
LOGFILEMAXSIZE        10000000
LOGLEVEL              3
QUEUETIMEWEIGHT       1
BACKFILLPOLICY        FIRSTFIT
RESERVATIONPOLICY     CURRENTHIGHEST
NODEALLOCATIONPOLICY  MINRESOURCE
ENABLEMULTIREQJOBS      TRUE
ENABLEMULTINODEJOBS     TRUE
JOBNODEMATCHPOLICY EXACTNODE


From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer
Sent: Monday, April 30, 2012 10:57 AM
To: Torque Users Mailing List
Subject: Re: [torqueusers] pbs_mom dies on exit of interactive session

Steve,

The two things that seem that they'd be the most helpful would be the text of any error messages you're seeing and steps to reproduce.

Thanks,

David
On Mon, Apr 30, 2012 at 10:20 AM, DuChene, StevenX A <stevenx.a.duchene at intel.com<mailto:stevenx.a.duchene at intel.com>> wrote:
If you can tell me what additional information you need I can try to obtain it.
--
Steven DuChene

From: torqueusers-bounces at supercluster.org<mailto:torqueusers-bounces at supercluster.org> [mailto:torqueusers-bounces at supercluster.org<mailto:torqueusers-bounces at supercluster.org>] On Behalf Of David Beer
Sent: Monday, April 30, 2012 9:00 AM
To: Torque Users Mailing List
Subject: Re: [torqueusers] pbs_mom dies on exit of interactive session
Roy,

I'm sorry your message wasn't attended to - its the nature of a mailing list that sometimes messages are lost. Can you (or Steve) describe the TM connect issues in a bit more detail? We should get a bug report filed for this.

David
On Sun, Apr 29, 2012 at 11:03 AM, Roy Dragseth <roy.dragseth at cc.uit.no<mailto:roy.dragseth at cc.uit.no>> wrote:
I can confirm this too (reported the same problem to the mailing list a month ago, but nobody seemed to care). Torque 3.0.4 works fine with exactly the same system config.

r.


On Saturday 28. April 2012 03.21.05<tel:2012%2003.21.05> DuChene, StevenX A wrote:
I am running torque-4.0.1 that I pulled from the svn 4.0.1 branch just today.
Earlier today I was running the 4.0-fixes tree from 04/03 and I had the same results.
I was hoping the update to current sources would fix these problems but no such luck.

If I run the following:

qsub -I -l nodes=7 -l arch=atomN570

from my pbs job submission host I get:

qsub: waiting for job 4.login2.sep.here to start
qsub: job 4.login2.sep.here ready

and then I get a shell prompt on the node 0 of this job.

If I then do:

$ echo $PBS_NODEFILE
/var/spool/torque/aux//4.login2.sep.here

And then:

$ cat /var/spool/torque/aux//4.login2.sep.here
atom255
atom255
atom255
atom255
atom254
atom254
atom254

and then I try:

$ pbsdsh -h atom254 ls /tmp
pbsdsh: error from tm_poll() 17002

Alternatively if I use the -v option it says:

$ pbsdsh -v -h atom254 /bin/ls /tmp
pbsdsh: tm_init failed, rc = TM_ESYSTEM (17000)

Then when I exit the shell I get back to my job submission node but when I then check for the pbs_mom process on that node 0 for the job that I just had the shell prompt on, the pbs_mom process is no longer running and the status message from the init script says:

pbs_mom dead but subsys locked

Also if I run any mpi jobs through torque that are run on cpu cores on one system, they seem to run just fine but for mpi jobs where the job spans across separate nodes I get the following error out each time:

$ cat script_viking_7nodes.pbs.e12
[viking12.sep.here:12360] plm:tm: failed to poll for a spawned daemon, return status = 17002
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
        viking11.sep.here - daemon did not report back when launched

The PBS job script for this MPI job is very simple:

#PBS -l nodes=7:Viking:ppn=1
#PBS -l arch=xeon

mpirun --machinefile $PBS_NODEFILE /home/sadX/mpi_test/mpi_hello_hostname > hello_out_viking7nodes

Also if I ssh to any node in my cluster and create a mpi nodes file by hand and then use that to run like this:

mpirun --machinefile ../nodefile ./mpi_hello_hostname

I get back all of the expected output from all of the nodes listed in the manually created nodes file.
--
Steven DuChene


--

The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
phone:+47 77 64 41 07<tel:%2B47%2077%2064%2041%2007>, fax:+47 77 64 41 00<tel:%2B47%2077%2064%2041%2000>
Roy Dragseth, Team Leader, High Performance Computing
Direct call: +47 77 64 62 56<tel:%2B47%2077%2064%2062%2056>. email: roy.dragseth at uit.no<mailto:roy.dragseth at uit.no>


_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers




--
David Beer | Software Engineer
Adaptive Computing

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers



--
David Beer | Software Engineer
Adaptive Computing

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120430/79878952/attachment-0001.html 


More information about the torqueusers mailing list