[torqueusers] Re: tm problem OSX PPC single node

Glen Beane glen.beane at gmail.com
Mon Feb 26 10:58:11 MST 2007


are you _sure_ you are running 2.1.6 moms?

On 2/26/07, Brock Palen <brockp at umich.edu> wrote:
> >
> > Message: 2
> > Date: Fri, 23 Feb 2007 13:07:17 -0700
> > From: Garrick Staples <garrick at clusterresources.com>
> > Subject: Re: [torqueusers] tm problem OSX PPC single node
> > To: torqueusers at supercluster.org
> > Message-ID: <20070223200717.GM25271 at login>
> > Content-Type: text/plain; charset=us-ascii
> >
> > On Fri, Feb 23, 2007 at 02:30:04PM -0500, Brock Palen alleged:
> >> We have found a problem when trying to run jobs using tm when running
> >> on only one node.  Which is quite strange.  If the MPI library (Lam
> >> or OpenMPI)  uses 2 nodes (nodes=2:ppn=2)  the job will start just
> >> fine.  But if its 1 (nodes=1:ppn=2)  the job can not start.  This is
> >> not a problem for serial jobs,  we are also using the same versions
> >> of torque and lam/openmpi on our linux cluster with no problems.  If
> >> i build a LAM without tm support the jobs run fine.
> >>
> >> I dug the archives and i found some references to a similar problem.
> >> Im just wondering what i should do to test it or if this is a known
> >> problem on OSX ?  The systems are running 10.3 on G5's,  its using
> >> torque-2.1.6.
> >
> > I'm not aware of any current OSX issues.
> >
> > It is easy to isolate this to TM with 'pbsdsh'.  Just do some tests
> > with
> > something like 'pbsdsh hostname' with the different sized jobs.
> >
> > If this fails, then it is definitely a TM problem.  Otherwise it
> > should
> > be punted to the OpenMPI peeps.
>
>
> Yes pbsdsh does fail,  When running one node
> nodes=1:ppn=2
>
> pbsdsh -v hostname
>
> the error output holds:
>
> pbsdsh: spawned task 0
> pbsdsh: spawned task 1
> pbsdsh: spawn event returned: 0 (2 spawns and 0 obits outstanding)
> pbsdsh: error 17000 on spawn
> pbsdsh: spawn event returned: 1 (1 spawns and 0 obits outstanding)
> pbsdsh: error 17000 on spawn
>
> And again works just fine when requesting 2 nodes,  or when running a
> Interactive job.  So that sounds kinda strange.
>
> Also im seeing the following in the pbs_mom logs dont know if its
> related but i found it trying to debug this problem:
>
> 02/26/2007 12:43:46;0001;   pbs_mom;Job;
> 19359.aon.engin.umich.edu;task not started, 'hostname', stdio setup
> failed (see syslog)
> 02/26/2007 12:43:46;0001;   pbs_mom;Job;
> 19359.aon.engin.umich.edu;task not started, 'hostname', stdio setup
> failed (see syslog)
> 02/26/2007 12:43:46;0008;   pbs_mom;Job;
> 19359.aon.engin.umich.edu;ERROR:    received request 'SPAWN_TASK'
> from 141.212.28.178:1023 for job '19359.aon.engin.umich.edu' (cannot
> start task)
>
> And system.log has:
>
> Feb 26 12:43:46 aon155 pbs_mom: open_std_file, std file exists with
> the wrong group, someone is doing something fishy
> Feb 26 12:43:46 aon155 pbs_mom: Bad file descriptor (9) in
> open_std_out_err, Unable to open standard output/error
> Feb 26 12:43:46 aon155 pbs_mom: Bad file descriptor (9) in
> start_process, cannot open job stderr/stdout files
> Feb 26 12:43:46 aon155 pbs_mom: open_std_file, std file exists with
> the wrong group, someone is doing something fishy
> Feb 26 12:43:46 aon155 pbs_mom: open_std_file, std file exists with
> the wrong group, someone is doing something fishy
>
> Thanks for any insight, I have never seen this error before myself.
>
> Brock Palen
> Cac.engin.umich.edu
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>


More information about the torqueusers mailing list