[torqueusers] Re: tm problem OSX PPC single node

Brock Palen brockp at umich.edu
Mon Feb 26 11:12:34 MST 2007


Yes

on:/usr/local/torque-2.1.6/sbin root# ./momctl -h aon155 -d 3

Host: aon155.engin.umich.edu/aon155.engin.umich.edu   Version: 2.1.6

Brock Palen
Center for Advanced Computing
brockp at umich.edu
(734)936-1985


On Feb 26, 2007, at 12:58 PM, Glen Beane wrote:

> are you _sure_ you are running 2.1.6 moms?
>
> On 2/26/07, Brock Palen <brockp at umich.edu> wrote:
>> >
>> > Message: 2
>> > Date: Fri, 23 Feb 2007 13:07:17 -0700
>> > From: Garrick Staples <garrick at clusterresources.com>
>> > Subject: Re: [torqueusers] tm problem OSX PPC single node
>> > To: torqueusers at supercluster.org
>> > Message-ID: <20070223200717.GM25271 at login>
>> > Content-Type: text/plain; charset=us-ascii
>> >
>> > On Fri, Feb 23, 2007 at 02:30:04PM -0500, Brock Palen alleged:
>> >> We have found a problem when trying to run jobs using tm when  
>> running
>> >> on only one node.  Which is quite strange.  If the MPI library  
>> (Lam
>> >> or OpenMPI)  uses 2 nodes (nodes=2:ppn=2)  the job will start just
>> >> fine.  But if its 1 (nodes=1:ppn=2)  the job can not start.   
>> This is
>> >> not a problem for serial jobs,  we are also using the same  
>> versions
>> >> of torque and lam/openmpi on our linux cluster with no  
>> problems.  If
>> >> i build a LAM without tm support the jobs run fine.
>> >>
>> >> I dug the archives and i found some references to a similar  
>> problem.
>> >> Im just wondering what i should do to test it or if this is a  
>> known
>> >> problem on OSX ?  The systems are running 10.3 on G5's,  its using
>> >> torque-2.1.6.
>> >
>> > I'm not aware of any current OSX issues.
>> >
>> > It is easy to isolate this to TM with 'pbsdsh'.  Just do some tests
>> > with
>> > something like 'pbsdsh hostname' with the different sized jobs.
>> >
>> > If this fails, then it is definitely a TM problem.  Otherwise it
>> > should
>> > be punted to the OpenMPI peeps.
>>
>>
>> Yes pbsdsh does fail,  When running one node
>> nodes=1:ppn=2
>>
>> pbsdsh -v hostname
>>
>> the error output holds:
>>
>> pbsdsh: spawned task 0
>> pbsdsh: spawned task 1
>> pbsdsh: spawn event returned: 0 (2 spawns and 0 obits outstanding)
>> pbsdsh: error 17000 on spawn
>> pbsdsh: spawn event returned: 1 (1 spawns and 0 obits outstanding)
>> pbsdsh: error 17000 on spawn
>>
>> And again works just fine when requesting 2 nodes,  or when running a
>> Interactive job.  So that sounds kinda strange.
>>
>> Also im seeing the following in the pbs_mom logs dont know if its
>> related but i found it trying to debug this problem:
>>
>> 02/26/2007 12:43:46;0001;   pbs_mom;Job;
>> 19359.aon.engin.umich.edu;task not started, 'hostname', stdio setup
>> failed (see syslog)
>> 02/26/2007 12:43:46;0001;   pbs_mom;Job;
>> 19359.aon.engin.umich.edu;task not started, 'hostname', stdio setup
>> failed (see syslog)
>> 02/26/2007 12:43:46;0008;   pbs_mom;Job;
>> 19359.aon.engin.umich.edu;ERROR:    received request 'SPAWN_TASK'
>> from 141.212.28.178:1023 for job '19359.aon.engin.umich.edu' (cannot
>> start task)
>>
>> And system.log has:
>>
>> Feb 26 12:43:46 aon155 pbs_mom: open_std_file, std file exists with
>> the wrong group, someone is doing something fishy
>> Feb 26 12:43:46 aon155 pbs_mom: Bad file descriptor (9) in
>> open_std_out_err, Unable to open standard output/error
>> Feb 26 12:43:46 aon155 pbs_mom: Bad file descriptor (9) in
>> start_process, cannot open job stderr/stdout files
>> Feb 26 12:43:46 aon155 pbs_mom: open_std_file, std file exists with
>> the wrong group, someone is doing something fishy
>> Feb 26 12:43:46 aon155 pbs_mom: open_std_file, std file exists with
>> the wrong group, someone is doing something fishy
>>
>> Thanks for any insight, I have never seen this error before myself.
>>
>> Brock Palen
>> Cac.engin.umich.edu
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>
>



More information about the torqueusers mailing list