[torqueusers] read of pipe for sid job error, more info
Glen Beane
beaneg at umcs.maine.edu
Wed Sep 22 12:32:48 MDT 2004
Any news on this?
Glen
On Sep 21, 2004, at 7:12 PM, jacksond at supercluster.org wrote:
> Glen,
>
> We are going to try to reproduce this in our lab and also look for
> potential unitialized variable usage on the mom side with some memory
> analysis tools. The problem may stem from the OS specific mom modules
> located in the *bsd directory and explain why this does not affect
> linux systems.
>
> We should be able to get you some information by noon tomorrow.
>
> Thanks for this piece of information. Is should be valuable.
>
> Dave
>
> On Tue, 21 Sep 2004, Glen Beane wrote:
>
>>
>> Here is some more info, that will hopefully be helpful in tracking
>> down the problem.
>>
>> This problem seems to stem from some kind of bazaar race condition
>> I'm seeing. I don't know who it affects, (all OS X, or maybe a
>> particular configuration?).
>>
>> What happens is, if the node's first job after boot it is mother
>> superior, then there are no problems. If the node is a 'slave' node
>> for its first job, then that node will not work (creating the read of
>> pipe errors in its log file).
>>
>> If I boot the cluster, and then 'bless' each node by running a simple
>> 1 node job on them, then this problem does not seem to appear.
>>
>>>> On Mon, 2004-09-20 at 16:07, Glen Beane wrote:
>>>>> I got Invalid Argument (22) in start_process, read of pipe...got 0
>>>>> not 8
>>>>> and
>>>>> Unknown error 0: (0) in start_process, read of pipe... got 0 not 8
>>>>> On Mon, 2004-09-20 at 15:46, jacksond at supercluster.org wrote:
>>>>>> Glen,
>>>>>> The logged error message should include the 'errno' value
>>>>>> associated
>>>>>> with the read of the pipe. This would definately be helpful to
>>>>>> get us
>>>>>> started.
>>>>>> Thanks,
>>>>>> Dave
>>>>>> On Mon, 20 Sep 2004, Glen Beane wrote:
>>>>>>> On my OS X cluster, I keep getting errors from pbs_mom in the
>>>>>>> form of
>>>>>>> "read of pipe for sid job xxx got 0 not 8".
>>>>>>> If I kill pbs_mom on the node with signal 15, then reboot the
>>>>>>> node,
>>>>>>> often the problem will seem to go away. Just restarting pbs_mom
>>>>>>> never
>>>>>>> fixes the problem.
>>>>>>> This error is coming from the start_process fuction, this
>>>>>>> particular
>>>>>>> block of code starts around line number 2199
>>>>>>> if (i != sizeof(sjr))
>>>>>>> {
>>>>>>> sprintf(log_buffer, "read of pipe for sid job %s got %d not %d",
>>>>>>> pjob->ji_qs.ji_jobid,
>>>>>>> i,
>>>>>>> sizeof(sjr));
>>>>>>> log_err(j,id,log_buffer);
>>>>>>> return(-1);
>>>>>>> }
>>>>>>> Any help troubleshooting this problem would be greatly
>>>>>>> appreciated.
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> torqueusers at supercluster.org
>>>>>>> http://supercluster.org/mailman/listinfo/torqueusers
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> torqueusers at supercluster.org
>>>>>> http://supercluster.org/mailman/listinfo/torqueusers
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://supercluster.org/mailman/listinfo/torqueusers
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://supercluster.org/mailman/listinfo/torqueusers
>>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers
>
More information about the torqueusers
mailing list