[torquedev] Problem with TM interface in Torque 2.1.0p0

Jeff Squyres (jsquyres) jsquyres at cisco.com
Fri May 19 05:02:55 MDT 2006


> -----Original Message-----
> From: torquedev-bounces at supercluster.org 
> [mailto:torquedev-bounces at supercluster.org] On Behalf Of 
> garrick at speculation.org
> Sent: Thursday, May 18, 2006 6:38 PM
> To: torquedev at supercluster.org
> Subject: Re: [torquedev] Problem with TM interface in Torque 2.1.0p0
> 
> Both mpiexec and pbsdsh use tm_nodeinfo().  If you can't do 'pbsdsh
> hostname' you know there is a problem.

aon062:~/tmp jsquyres$ echo $PBS_ENVIRONMENT $PBS_JOBID
PBS_INTERACTIVE 405.aon.[[[private]]]
aon062:~/tmp jsquyres$ pbs-config --version
2.1.0p0
aon062:~/tmp jsquyres$ pbsdsh hostname
pbsdsh: tm_nodeinfo failed, rc = TM_ESYSTEM (17000)

Doh.  :(

> I tried your test program and it works fine for me.  Nothing 
> has changed
> in that code in a long time.  I suspect the TM code is fine, but MOM
> thinks the job has 0 vnodes (which AFAIK is impossible.)

Ok.  FWIW, this is an OSX cluster running on PPC hardware:

aon062:~/tmp jsquyres$ uname -a
Darwin aon062.[[[private]]] 7.9.0 Darwin Kernel Version 7.9.0: Wed Mar
30 20:11:17 PST 2005; root:xnu/xnu-517.12.7.obj~1/RELEASE_PPC  Power
Macintosh powerpc

This cluster has apparently been running prior versions of Torque for
some time, and a recent system maintenance window made it a good
opportunity to upgrade to 2.1.0p0.

> Can you send a 'qstat -f $jobid' and 'momctl -d 4 -h $node' output?
> Both of those should be run as an admin user (root) from the 
> pbs_server
> to ensure all info is reported.

I'm not root on this cluster, so I'll have to ask the owners to do it
(they might be on this list...?).  I can run qstat on the head node (as
non-root) -- I don't know if this is useful:

aon:~ jsquyres$ qstat -f 405.aon.[[[private]]]
Job Id: 405.aon.[[[private]]]
    Job_Name = STDIN
    Job_Owner = jsquyres at aon.[[[private]]]
    resources_used.cput = 00:00:00
    resources_used.mem = 0kb
    resources_used.vmem = 0kb
    resources_used.walltime = 00:04:25
    job_state = R
    queue = short
    server = aon.[[[private]]
    Checkpoint = u
    ctime = Fri May 19 06:52:45 2006
    Error_Path = /dev/ttyp0
    exec_host = aon062/1+aon062/0
    Hold_Types = n
    interactive = True
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Fri May 19 06:53:10 2006
    Output_Path = /dev/ttyp0
    Priority = 0
    qtime = Fri May 19 06:52:45 2006
    Rerunable = False
    Resource_List.nodect = 2
    Resource_List.nodes = 2
    Resource_List.walltime = 01:00:00
    session_id = 29504
    Variable_List = PBS_O_HOME=/home1/jsquyres,PBS_O_LOGNAME=jsquyres,
 
PBS_O_PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbi
 
n:/usr/cac/lam/bin:/opt/ibmcmp/xlf/8.1/bin:/usr/local/radmind-1.3.1/bin
        :/usr/local/radmind-1.3.1/sbin:/Applications/Tivoli Storage
Manager v5.
 
2.3:/usr/local/maui/bin:/sw/bin:/sw/sbin:/usr/X11R6/bin:/home/software/
        torque-2.1.0p0/bin:/usr/local/maui-3.2.6p14-snap.1138394201/bin,
        PBS_O_MAIL=/var/mail/jsquyres,PBS_O_SHELL=/bin/bash,
        PBS_O_HOST=aon.[[[private]]],PBS_O_WORKDIR=/home1/jsquyres,
        PBS_O_QUEUE=short
    etime = Fri May 19 06:52:45 2006

momctl explicitly requires root, so I'll have to get back to you on that
one.

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


More information about the torquedev mailing list