[torqueusers] Question about the difference between a node where pbs_server is run and a compute node

Bas van der Vlies basv at sara.nl
Wed Apr 28 11:01:34 MDT 2010


On 28 apr 2010, at 17:05, Ken Nielson wrote:

> On 04/28/2010 03:56 AM, Bas van der Vlies wrote:
>> Hello,
>> 
>>   We just installed version 2.4.7 and experiencing some serious problems
>> with executing programs on the server. I noticed that the server
>> is using '/tmp/.torque-unix' and the clients 'pbs_iff'.
>> 
>> The following test on the pbs_server node will completely hang pbs_server.
>> Here some pseudo code:
>>    p = pbs_connect( pbs_default() )
>> 
>> After this we can not do anythinng on all compute nodes and server:
>>     - qstat, qsub, .....
>> 
>> On a compute node this no problem at all. So i except the /tmp/.torque-unix
>> is causing the problem.
>> 
>> Is this a known problem or a bug?
>> 
>> Regards
>> 
>> 
>> 
> There are several things we need to look at. The first one Garrick 
> already addressed. Is the MOM running on the same node as the pbs_server?
> 
> If not is there evidence in the log files that both the server and the 
> client are communicating?
> 



This is only on our node that runs pbs_server process. So no pbs_mom is running on this node.  So if i execute the above program on the node that runs the pbs_server everything freezes.  On our compute node where only pbs_mom is running no problem at all. What i did is an strace and and saw this difference.

It hangs in a poll call and if i remember correctly we get an ASN protocol error after a few seconds.  With version 2.1.11 we do not encounter this problem.

Regards
--
Bas van der Vlies
basv at sara.nl





More information about the torqueusers mailing list