[torqueusers] read of pipe for sid job error, more info

Glen Beane beaneg at umcs.maine.edu
Wed Sep 22 12:32:48 MDT 2004


Any news on this?

Glen

On Sep 21, 2004, at 7:12 PM, jacksond at supercluster.org wrote:

> Glen,
>
>   We are going to try to reproduce this in our lab and also look for 
> potential unitialized variable usage on the mom side with some memory 
> analysis tools.  The problem may stem from the OS specific mom modules 
> located in the *bsd directory and explain why this does not affect 
> linux systems.
>
>   We should be able to get you some information by noon tomorrow.
>
>   Thanks for this piece of information.  Is should be valuable.
>
> Dave
>
>  On Tue, 21 Sep 2004, Glen Beane wrote:
>
>>
>> Here is some more info, that will hopefully be helpful in tracking 
>> down the problem.
>>
>> This problem seems to stem from some kind of bazaar race condition 
>> I'm seeing.  I don't know who it affects, (all OS X, or maybe a 
>> particular configuration?).
>>
>> What happens is, if the node's first job after boot it is mother 
>> superior, then there are no problems.  If the node is a 'slave' node 
>> for its first job, then that node will not work (creating the read of 
>> pipe errors in its log file).
>>
>> If I boot the cluster, and then 'bless' each node by running a simple 
>> 1 node job on them, then this problem does not seem to appear.
>>
>>>> On Mon, 2004-09-20 at 16:07, Glen Beane wrote:
>>>>> I got Invalid Argument (22) in start_process, read of pipe...got 0 
>>>>> not 8
>>>>> and
>>>>> Unknown error 0: (0) in start_process, read of pipe... got 0 not 8
>>>>> On Mon, 2004-09-20 at 15:46, jacksond at supercluster.org wrote:
>>>>>> Glen,
>>>>>>    The logged error message should include the 'errno' value 
>>>>>> associated
>>>>>> with the read of the pipe.  This would definately be helpful to 
>>>>>> get us
>>>>>> started.
>>>>>> Thanks,
>>>>>> Dave
>>>>>> On Mon, 20 Sep 2004, Glen Beane wrote:
>>>>>>> On my OS X cluster, I keep getting errors from pbs_mom in the 
>>>>>>> form of
>>>>>>> "read of pipe for sid job xxx got 0 not 8".
>>>>>>> If I kill pbs_mom on the node with signal 15, then reboot the 
>>>>>>> node,
>>>>>>> often the problem will seem to go away. Just restarting pbs_mom 
>>>>>>> never
>>>>>>> fixes the problem.
>>>>>>> This error is coming from the start_process fuction, this 
>>>>>>> particular
>>>>>>> block of code starts around line number 2199
>>>>>>> if (i != sizeof(sjr))
>>>>>>> {
>>>>>>>  sprintf(log_buffer, "read of pipe for sid job %s got %d not %d",
>>>>>>>    pjob->ji_qs.ji_jobid,
>>>>>>>    i,
>>>>>>>    sizeof(sjr));
>>>>>>>  log_err(j,id,log_buffer);
>>>>>>>  return(-1);
>>>>>>> }
>>>>>>> Any help troubleshooting this problem would be greatly 
>>>>>>> appreciated.
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> torqueusers at supercluster.org
>>>>>>> http://supercluster.org/mailman/listinfo/torqueusers
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> torqueusers at supercluster.org
>>>>>> http://supercluster.org/mailman/listinfo/torqueusers
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://supercluster.org/mailman/listinfo/torqueusers
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://supercluster.org/mailman/listinfo/torqueusers
>>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers
>



More information about the torqueusers mailing list