[torqueusers] Re: tm problem OSX PPC single node

Brock Palen brockp at umich.edu
Mon Feb 26 10:53:48 MST 2007


>
> Message: 2
> Date: Fri, 23 Feb 2007 13:07:17 -0700
> From: Garrick Staples <garrick at clusterresources.com>
> Subject: Re: [torqueusers] tm problem OSX PPC single node
> To: torqueusers at supercluster.org
> Message-ID: <20070223200717.GM25271 at login>
> Content-Type: text/plain; charset=us-ascii
>
> On Fri, Feb 23, 2007 at 02:30:04PM -0500, Brock Palen alleged:
>> We have found a problem when trying to run jobs using tm when running
>> on only one node.  Which is quite strange.  If the MPI library (Lam
>> or OpenMPI)  uses 2 nodes (nodes=2:ppn=2)  the job will start just
>> fine.  But if its 1 (nodes=1:ppn=2)  the job can not start.  This is
>> not a problem for serial jobs,  we are also using the same versions
>> of torque and lam/openmpi on our linux cluster with no problems.  If
>> i build a LAM without tm support the jobs run fine.
>>
>> I dug the archives and i found some references to a similar problem.
>> Im just wondering what i should do to test it or if this is a known
>> problem on OSX ?  The systems are running 10.3 on G5's,  its using
>> torque-2.1.6.
>
> I'm not aware of any current OSX issues.
>
> It is easy to isolate this to TM with 'pbsdsh'.  Just do some tests  
> with
> something like 'pbsdsh hostname' with the different sized jobs.
>
> If this fails, then it is definitely a TM problem.  Otherwise it  
> should
> be punted to the OpenMPI peeps.


Yes pbsdsh does fail,  When running one node
nodes=1:ppn=2

pbsdsh -v hostname

the error output holds:

pbsdsh: spawned task 0
pbsdsh: spawned task 1
pbsdsh: spawn event returned: 0 (2 spawns and 0 obits outstanding)
pbsdsh: error 17000 on spawn
pbsdsh: spawn event returned: 1 (1 spawns and 0 obits outstanding)
pbsdsh: error 17000 on spawn

And again works just fine when requesting 2 nodes,  or when running a  
Interactive job.  So that sounds kinda strange.

Also im seeing the following in the pbs_mom logs dont know if its  
related but i found it trying to debug this problem:

02/26/2007 12:43:46;0001;   pbs_mom;Job; 
19359.aon.engin.umich.edu;task not started, 'hostname', stdio setup  
failed (see syslog)
02/26/2007 12:43:46;0001;   pbs_mom;Job; 
19359.aon.engin.umich.edu;task not started, 'hostname', stdio setup  
failed (see syslog)
02/26/2007 12:43:46;0008;   pbs_mom;Job; 
19359.aon.engin.umich.edu;ERROR:    received request 'SPAWN_TASK'  
from 141.212.28.178:1023 for job '19359.aon.engin.umich.edu' (cannot  
start task)

And system.log has:

Feb 26 12:43:46 aon155 pbs_mom: open_std_file, std file exists with  
the wrong group, someone is doing something fishy
Feb 26 12:43:46 aon155 pbs_mom: Bad file descriptor (9) in  
open_std_out_err, Unable to open standard output/error
Feb 26 12:43:46 aon155 pbs_mom: Bad file descriptor (9) in  
start_process, cannot open job stderr/stdout files
Feb 26 12:43:46 aon155 pbs_mom: open_std_file, std file exists with  
the wrong group, someone is doing something fishy
Feb 26 12:43:46 aon155 pbs_mom: open_std_file, std file exists with  
the wrong group, someone is doing something fishy

Thanks for any insight, I have never seen this error before myself.

Brock Palen
Cac.engin.umich.edu




More information about the torqueusers mailing list