[torqueusers] Re: tm problem OSX PPC single node
Brock Palen
brockp at umich.edu
Mon Feb 26 10:53:48 MST 2007
>
> Message: 2
> Date: Fri, 23 Feb 2007 13:07:17 -0700
> From: Garrick Staples <garrick at clusterresources.com>
> Subject: Re: [torqueusers] tm problem OSX PPC single node
> To: torqueusers at supercluster.org
> Message-ID: <20070223200717.GM25271 at login>
> Content-Type: text/plain; charset=us-ascii
>
> On Fri, Feb 23, 2007 at 02:30:04PM -0500, Brock Palen alleged:
>> We have found a problem when trying to run jobs using tm when running
>> on only one node. Which is quite strange. If the MPI library (Lam
>> or OpenMPI) uses 2 nodes (nodes=2:ppn=2) the job will start just
>> fine. But if its 1 (nodes=1:ppn=2) the job can not start. This is
>> not a problem for serial jobs, we are also using the same versions
>> of torque and lam/openmpi on our linux cluster with no problems. If
>> i build a LAM without tm support the jobs run fine.
>>
>> I dug the archives and i found some references to a similar problem.
>> Im just wondering what i should do to test it or if this is a known
>> problem on OSX ? The systems are running 10.3 on G5's, its using
>> torque-2.1.6.
>
> I'm not aware of any current OSX issues.
>
> It is easy to isolate this to TM with 'pbsdsh'. Just do some tests
> with
> something like 'pbsdsh hostname' with the different sized jobs.
>
> If this fails, then it is definitely a TM problem. Otherwise it
> should
> be punted to the OpenMPI peeps.
Yes pbsdsh does fail, When running one node
nodes=1:ppn=2
pbsdsh -v hostname
the error output holds:
pbsdsh: spawned task 0
pbsdsh: spawned task 1
pbsdsh: spawn event returned: 0 (2 spawns and 0 obits outstanding)
pbsdsh: error 17000 on spawn
pbsdsh: spawn event returned: 1 (1 spawns and 0 obits outstanding)
pbsdsh: error 17000 on spawn
And again works just fine when requesting 2 nodes, or when running a
Interactive job. So that sounds kinda strange.
Also im seeing the following in the pbs_mom logs dont know if its
related but i found it trying to debug this problem:
02/26/2007 12:43:46;0001; pbs_mom;Job;
19359.aon.engin.umich.edu;task not started, 'hostname', stdio setup
failed (see syslog)
02/26/2007 12:43:46;0001; pbs_mom;Job;
19359.aon.engin.umich.edu;task not started, 'hostname', stdio setup
failed (see syslog)
02/26/2007 12:43:46;0008; pbs_mom;Job;
19359.aon.engin.umich.edu;ERROR: received request 'SPAWN_TASK'
from 141.212.28.178:1023 for job '19359.aon.engin.umich.edu' (cannot
start task)
And system.log has:
Feb 26 12:43:46 aon155 pbs_mom: open_std_file, std file exists with
the wrong group, someone is doing something fishy
Feb 26 12:43:46 aon155 pbs_mom: Bad file descriptor (9) in
open_std_out_err, Unable to open standard output/error
Feb 26 12:43:46 aon155 pbs_mom: Bad file descriptor (9) in
start_process, cannot open job stderr/stdout files
Feb 26 12:43:46 aon155 pbs_mom: open_std_file, std file exists with
the wrong group, someone is doing something fishy
Feb 26 12:43:46 aon155 pbs_mom: open_std_file, std file exists with
the wrong group, someone is doing something fishy
Thanks for any insight, I have never seen this error before myself.
Brock Palen
Cac.engin.umich.edu
More information about the torqueusers
mailing list