[torqueusers] Re: Latest Torque snapshot breaks mpiexec ?

Chris Samuel csamuel at vpac.org
Tue Aug 17 17:12:02 MDT 2004

On Wed, 18 Aug 2004 02:08 am, Pete Wyckoff wrote:

> I don't know how torque broke this, but I can tell you what's going on.

There is a chance that it's a misstep on my part (see the end of the email) 
but I can check that fairly quickly.  In the meantime I've included the info 
that you were after from qstat.

> Mpiexec does the equivalent of "qstat -f $PBS_JOBID" for the Resource_List
> lines.  On good-old OpenPBS 2.3.12 + patches, we get:
>         Resource_List.neednodes = piv011:ppn=2+piv012:ppn=2
> 	Resource_List.nodect = 2
> 	Resource_List.nodes = 2:ppn=2
> 	Resource_List.walltime = 24:00:00

That's not what I'm getting back from Torque, I'm seeing:

Job Id: 53504.XXXX
    Job_Name = STDIN
    Job_Owner = samuelc at XXXX
    job_state = Q
    queue = dque
    server = mgt.XXXX
    Checkpoint = u
    ctime = Wed Aug 18 08:56:19 2004
    Error_Path = mgt.XXXX:/home/samuelc/STDIN.e53504
    exec_host = 
    Hold_Types = n
    Join_Path = n
    Keep_Files = oe
    Mail_Points = a
    mtime = Wed Aug 18 09:01:52 2004
    Output_Path = mgt.XXXX:/home/samuelc/STDIN.o53504
    Priority = 0
    qtime = Wed Aug 18 08:56:19 2004
    Rerunable = True
    Resource_List.nodect = 45
    Resource_List.nodes = 45:ppn=2
    Variable_List = PBS_O_HOME=/home/samuelc,PBS_O_LANG=en_US.UTF-8,
    etime = Wed Aug 18 08:56:19 2004

> If this were a single SMP machine configured with the "ts" attribute in
> the PBS config fileyou would see instead something like this:
> 	Resource_List.ncpus = 2
> 	Resource_List.walltime = 24:00:00

Nothing has the ts option set.

> Detecting that difference lets mpiexec properly handle these machines.
> In a typical cluster you can comment out the block in get_hosts.c:235..261
> and probably proceed happily.  Any info you can give me about how torque
> is supplying this info would help me to hack around their oddities and
> avoid problems for future users.  Like the full output of qstat -f, e.g.

See above.

Hmm, one thought does cross my mind, I haven't recompiled MAUI on this cluster 
against the latest Torque snapshot, so it may be a problem on my part.

I'll fix this and retry to see if this really a problem.

Thanks Peter!

 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: signature
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20040818/3199f356/attachment.bin

More information about the torqueusers mailing list