[torqueusers] Question about the difference between a node where pbs_server is run and a compute node
Bas van der Vlies
basv at sara.nl
Wed Apr 28 11:01:34 MDT 2010
On 28 apr 2010, at 17:05, Ken Nielson wrote:
> On 04/28/2010 03:56 AM, Bas van der Vlies wrote:
>> Hello,
>>
>> We just installed version 2.4.7 and experiencing some serious problems
>> with executing programs on the server. I noticed that the server
>> is using '/tmp/.torque-unix' and the clients 'pbs_iff'.
>>
>> The following test on the pbs_server node will completely hang pbs_server.
>> Here some pseudo code:
>> p = pbs_connect( pbs_default() )
>>
>> After this we can not do anythinng on all compute nodes and server:
>> - qstat, qsub, .....
>>
>> On a compute node this no problem at all. So i except the /tmp/.torque-unix
>> is causing the problem.
>>
>> Is this a known problem or a bug?
>>
>> Regards
>>
>>
>>
> There are several things we need to look at. The first one Garrick
> already addressed. Is the MOM running on the same node as the pbs_server?
>
> If not is there evidence in the log files that both the server and the
> client are communicating?
>
This is only on our node that runs pbs_server process. So no pbs_mom is running on this node. So if i execute the above program on the node that runs the pbs_server everything freezes. On our compute node where only pbs_mom is running no problem at all. What i did is an strace and and saw this difference.
It hangs in a poll call and if i remember correctly we get an ASN protocol error after a few seconds. With version 2.1.11 we do not encounter this problem.
Regards
--
Bas van der Vlies
basv at sara.nl
More information about the torqueusers
mailing list