[torquedev] Problem with TM interface in Torque 2.1.0p0
Jeff Squyres (jsquyres)
jsquyres at cisco.com
Fri May 19 05:02:55 MDT 2006
> -----Original Message-----
> From: torquedev-bounces at supercluster.org
> [mailto:torquedev-bounces at supercluster.org] On Behalf Of
> garrick at speculation.org
> Sent: Thursday, May 18, 2006 6:38 PM
> To: torquedev at supercluster.org
> Subject: Re: [torquedev] Problem with TM interface in Torque 2.1.0p0
>
> Both mpiexec and pbsdsh use tm_nodeinfo(). If you can't do 'pbsdsh
> hostname' you know there is a problem.
aon062:~/tmp jsquyres$ echo $PBS_ENVIRONMENT $PBS_JOBID
PBS_INTERACTIVE 405.aon.[[[private]]]
aon062:~/tmp jsquyres$ pbs-config --version
2.1.0p0
aon062:~/tmp jsquyres$ pbsdsh hostname
pbsdsh: tm_nodeinfo failed, rc = TM_ESYSTEM (17000)
Doh. :(
> I tried your test program and it works fine for me. Nothing
> has changed
> in that code in a long time. I suspect the TM code is fine, but MOM
> thinks the job has 0 vnodes (which AFAIK is impossible.)
Ok. FWIW, this is an OSX cluster running on PPC hardware:
aon062:~/tmp jsquyres$ uname -a
Darwin aon062.[[[private]]] 7.9.0 Darwin Kernel Version 7.9.0: Wed Mar
30 20:11:17 PST 2005; root:xnu/xnu-517.12.7.obj~1/RELEASE_PPC Power
Macintosh powerpc
This cluster has apparently been running prior versions of Torque for
some time, and a recent system maintenance window made it a good
opportunity to upgrade to 2.1.0p0.
> Can you send a 'qstat -f $jobid' and 'momctl -d 4 -h $node' output?
> Both of those should be run as an admin user (root) from the
> pbs_server
> to ensure all info is reported.
I'm not root on this cluster, so I'll have to ask the owners to do it
(they might be on this list...?). I can run qstat on the head node (as
non-root) -- I don't know if this is useful:
aon:~ jsquyres$ qstat -f 405.aon.[[[private]]]
Job Id: 405.aon.[[[private]]]
Job_Name = STDIN
Job_Owner = jsquyres at aon.[[[private]]]
resources_used.cput = 00:00:00
resources_used.mem = 0kb
resources_used.vmem = 0kb
resources_used.walltime = 00:04:25
job_state = R
queue = short
server = aon.[[[private]]
Checkpoint = u
ctime = Fri May 19 06:52:45 2006
Error_Path = /dev/ttyp0
exec_host = aon062/1+aon062/0
Hold_Types = n
interactive = True
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Fri May 19 06:53:10 2006
Output_Path = /dev/ttyp0
Priority = 0
qtime = Fri May 19 06:52:45 2006
Rerunable = False
Resource_List.nodect = 2
Resource_List.nodes = 2
Resource_List.walltime = 01:00:00
session_id = 29504
Variable_List = PBS_O_HOME=/home1/jsquyres,PBS_O_LOGNAME=jsquyres,
PBS_O_PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbi
n:/usr/cac/lam/bin:/opt/ibmcmp/xlf/8.1/bin:/usr/local/radmind-1.3.1/bin
:/usr/local/radmind-1.3.1/sbin:/Applications/Tivoli Storage
Manager v5.
2.3:/usr/local/maui/bin:/sw/bin:/sw/sbin:/usr/X11R6/bin:/home/software/
torque-2.1.0p0/bin:/usr/local/maui-3.2.6p14-snap.1138394201/bin,
PBS_O_MAIL=/var/mail/jsquyres,PBS_O_SHELL=/bin/bash,
PBS_O_HOST=aon.[[[private]]],PBS_O_WORKDIR=/home1/jsquyres,
PBS_O_QUEUE=short
etime = Fri May 19 06:52:45 2006
momctl explicitly requires root, so I'll have to get back to you on that
one.
--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems
More information about the torquedev
mailing list