[torqueusers] Re: Latest Torque snapshot breaks mpiexec ?

Chris Samuel csamuel at vpac.org
Tue Aug 17 17:12:02 MDT 2004


On Wed, 18 Aug 2004 02:08 am, Pete Wyckoff wrote:

> I don't know how torque broke this, but I can tell you what's going on.

There is a chance that it's a misstep on my part (see the end of the email) 
but I can check that fairly quickly.  In the meantime I've included the info 
that you were after from qstat.

> Mpiexec does the equivalent of "qstat -f $PBS_JOBID" for the Resource_List
> lines.  On good-old OpenPBS 2.3.12 + patches, we get:
>
>         Resource_List.neednodes = piv011:ppn=2+piv012:ppn=2
> 	Resource_List.nodect = 2
> 	Resource_List.nodes = 2:ppn=2
> 	Resource_List.walltime = 24:00:00

That's not what I'm getting back from Torque, I'm seeing:

Job Id: 53504.XXXX
    Job_Name = STDIN
    Job_Owner = samuelc at XXXX
    job_state = Q
    queue = dque
    server = mgt.XXXX
    Checkpoint = u
    ctime = Wed Aug 18 08:56:19 2004
    Error_Path = mgt.XXXX:/home/samuelc/STDIN.e53504
    exec_host = 
xnode61/1+xnode61/0+xnode60/1+xnode60/0+xnode59/1+xnode59/0+xno
        de58/1+xnode58/0+xnode57/1+xnode57/0+xnode56/1+xnode56/0+xnode55/1+xnod
        e55/0+xnode54/1+xnode54/0+xnode53/1+xnode53/0+xnode52/1+xnode52/0+xnode
        51/1+xnode51/0+xnode50/1+xnode50/0+xnode49/1+xnode49/0+xnode48/1+xnode4
        8/0+xnode47/1+xnode47/0+xnode46/1+xnode46/0+xnode45/1+xnode45/0+xnode44
        /1+xnode44/0+xnode43/1+xnode43/0+xnode42/1+xnode42/0+xnode41/1+xnode41/
        0+xnode40/1+xnode40/0+xnode39/1+xnode39/0+xnode38/1+xnode38/0+xnode37/1
        +xnode37/0+xnode36/1+xnode36/0+xnode35/1+xnode35/0+xnode34/1+xnode34/0+
        xnode33/1+xnode33/0+xnode32/1+xnode32/0+xnode31/1+xnode31/0+xnode30/1+x
        node30/0+xnode29/1+xnode29/0+xnode28/1+xnode28/0+xnode27/1+xnode27/0+xn
        ode26/1+xnode26/0+xnode25/1+xnode25/0+xnode24/1+xnode24/0+xnode23/1+xno
        de23/0+xnode22/1+xnode22/0+xnode21/1+xnode21/0+xnode20/1+xnode20/0+xnod
        e19/1+xnode19/0+xnode18/1+xnode18/0+xnode17/1+xnode17/0
    Hold_Types = n
    Join_Path = n
    Keep_Files = oe
    Mail_Points = a
    mtime = Wed Aug 18 09:01:52 2004
    Output_Path = mgt.XXXX:/home/samuelc/STDIN.o53504
    Priority = 0
    qtime = Wed Aug 18 08:56:19 2004
    Rerunable = True
    Resource_List.nodect = 45
    Resource_List.nodes = 45:ppn=2
    Variable_List = PBS_O_HOME=/home/samuelc,PBS_O_LANG=en_US.UTF-8,
        PBS_O_LOGNAME=samuelc,
        PBS_O_PATH=/usr/kerberos/bin:/opt/csm/bin:/usr/local/bin:/bin:/usr/bin
        :/usr/X11R6/bin:/opt/IBMJava2-131/jre/bin/:/opt/csm/ect/bin:/opt/csm/ec
:/home/samuelc/bin,PBS_O_MAIL=/var/spool/mail/samuelc,
        PBS_O_SHELL=/bin/bash,PBS_O_HOST=mgt.XXXX,
        PBS_O_WORKDIR=/home/samuelc,PBS_O_QUEUE=dque
    etime = Wed Aug 18 08:56:19 2004



> If this were a single SMP machine configured with the "ts" attribute in
> the PBS config fileyou would see instead something like this:
>
> 	Resource_List.ncpus = 2
> 	Resource_List.walltime = 24:00:00

Nothing has the ts option set.

> Detecting that difference lets mpiexec properly handle these machines.
>
> In a typical cluster you can comment out the block in get_hosts.c:235..261
> and probably proceed happily.  Any info you can give me about how torque
> is supplying this info would help me to hack around their oddities and
> avoid problems for future users.  Like the full output of qstat -f, e.g.

See above.

Hmm, one thought does cross my mind, I haven't recompiled MAUI on this cluster 
against the latest Torque snapshot, so it may be a problem on my part.

I'll fix this and retry to see if this really a problem.

Thanks Peter!

cheers!
Chris
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: signature
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20040818/3199f356/attachment.bin


More information about the torqueusers mailing list