[torqueusers] read of pipe for sid job error, more info

jacksond at supercluster.org jacksond at supercluster.org
Tue Sep 21 17:12:06 MDT 2004


Glen,

   We are going to try to reproduce this in our lab and also look for 
potential unitialized variable usage on the mom side with some memory 
analysis tools.  The problem may stem from the OS specific mom modules 
located in the *bsd directory and explain why this does not affect linux 
systems.

   We should be able to get you some information by noon tomorrow.

   Thanks for this piece of information.  Is should be valuable.

Dave

  On Tue, 21 Sep 2004, Glen Beane wrote:

>
> Here is some more info, that will hopefully be helpful in tracking down the 
> problem.
>
> This problem seems to stem from some kind of bazaar race condition I'm 
> seeing.  I don't know who it affects, (all OS X, or maybe a particular 
> configuration?).
>
> What happens is, if the node's first job after boot it is mother superior, 
> then there are no problems.  If the node is a 'slave' node for its first job, 
> then that node will not work (creating the read of pipe errors in its log 
> file).
>
> If I boot the cluster, and then 'bless' each node by running a simple 1 node 
> job on them, then this problem does not seem to appear.
>
>>> On Mon, 2004-09-20 at 16:07, Glen Beane wrote:
>>> 
>>>> I got Invalid Argument (22) in start_process, read of pipe...got 0 not 8
>>>> 
>>>> and
>>>> 
>>>> Unknown error 0: (0) in start_process, read of pipe... got 0 not 8
>>>> 
>>>> On Mon, 2004-09-20 at 15:46, jacksond at supercluster.org wrote:
>>>> 
>>>>> Glen,
>>>>> 
>>>>>    The logged error message should include the 'errno' value associated
>>>>> with the read of the pipe.  This would definately be helpful to get us
>>>>> started.
>>>>> 
>>>>> Thanks,
>>>>> Dave
>>>>> 
>>>>> On Mon, 20 Sep 2004, Glen Beane wrote:
>>>>> 
>>>>>> On my OS X cluster, I keep getting errors from pbs_mom in the form of
>>>>>> "read of pipe for sid job xxx got 0 not 8".
>>>>>> 
>>>>>> If I kill pbs_mom on the node with signal 15, then reboot the node,
>>>>>> often the problem will seem to go away. Just restarting pbs_mom never
>>>>>> fixes the problem.
>>>>>> 
>>>>>> 
>>>>>> This error is coming from the start_process fuction, this particular
>>>>>> block of code starts around line number 2199
>>>>>> 
>>>>>> if (i != sizeof(sjr))
>>>>>> {
>>>>>>  sprintf(log_buffer, "read of pipe for sid job %s got %d not %d",
>>>>>>    pjob->ji_qs.ji_jobid,
>>>>>>    i,
>>>>>>    sizeof(sjr));
>>>>>> 
>>>>>>  log_err(j,id,log_buffer);
>>>>>> 
>>>>>>  return(-1);
>>>>>> }
>>>>>> 
>>>>>> 
>>>>>> Any help troubleshooting this problem would be greatly appreciated.
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> torqueusers at supercluster.org
>>>>>> http://supercluster.org/mailman/listinfo/torqueusers
>>>>>> 
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://supercluster.org/mailman/listinfo/torqueusers
>>>> 
>>>> 
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://supercluster.org/mailman/listinfo/torqueusers
>>> 
>>> 
>> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers
>


More information about the torqueusers mailing list