[torqueusers] Re: Latest Torque snapshot breaks mpiexec ?

Pete Wyckoff pw at osc.edu
Tue Aug 17 10:08:33 MDT 2004


csamuel at vpac.org wrote on Mon, 16 Aug 2004 16:40 +1000:
> Just upgraded a cluster to the latest Torque snapshot 
> (torque-1.1.0p1-snap.1091461836) and found that this breaks mpiexec 0.76 
> (which has been completely rebuilt against the latest install).
> 
> I now get the error:
> 
> $ /usr/local/bin/mpiexec hostname
> mpiexec: Error: get_hosts: pbs_statjob returned neither "ncpus" nor "nodect".

I don't know how torque broke this, but I can tell you what's going on.
Mpiexec does the equivalent of "qstat -f $PBS_JOBID" for the Resource_List
lines.  On good-old OpenPBS 2.3.12 + patches, we get:

        Resource_List.neednodes = piv011:ppn=2+piv012:ppn=2
	Resource_List.nodect = 2
	Resource_List.nodes = 2:ppn=2
	Resource_List.walltime = 24:00:00

If this were a single SMP machine configured with the "ts" attribute in
the PBS config fileyou would see instead something like this:

	Resource_List.ncpus = 2
	Resource_List.walltime = 24:00:00

Detecting that difference lets mpiexec properly handle these machines.

In a typical cluster you can comment out the block in get_hosts.c:235..261
and probably proceed happily.  Any info you can give me about how torque
is supplying this info would help me to hack around their oddities and
avoid problems for future users.  Like the full output of qstat -f, e.g.

		-- Pete


More information about the torqueusers mailing list