[torqueusers] Torque - DIS problems

Piotr Siwczak psiwczak at man.poznan.pl
Fri Oct 27 04:51:28 MDT 2006


Hi,
We are running an SGI Altix 3700 machine with 128 CPUs and 256GB RAM. 
The machine was 64-CPU at start, but we upgraded it recently to 128-CPU. 
We also turned fairshare some time ago.
After these 2 events we started to have problems with torque/maui (we 
run: 2.0.0p10 and maui3.2.6p13).
Looks like torque server + mom fail to communicate with each other. We 
keep getting messages about DIS errors:

In pbs_server:
(grep through logs for DIS):
10/27/2006 08:52:19;0080;PBS_Server;Req;req_reject;Reject reply 
code=15056(Bad DIS based Request Protocol MSG=cannot decode message), 
aux=0, type=Connect, from @
10/27/2006 08:52:19;0002;PBS_Server;Req;dis_reply_write;DIS reply 
failure, -1
10/27/2006 10:16:50;0080;PBS_Server;Req;req_reject;Reject reply 
code=15056(Bad DIS based Request Protocol MSG=cannot decode message), 
aux=0, type=Connect, from @
10/27/2006 10:16:50;0002;PBS_Server;Req;dis_reply_write;DIS reply 
failure, -1
10/27/2006 12:01:52;0002;PBS_Server;Req;dis_reply_write;DIS reply 
failure, -1
10/27/2006 12:01:52;0002;PBS_Server;Req;dis_reply_write;DIS reply 
failure, -10
10/27/2006 12:01:52;0002;PBS_Server;Req;dis_reply_write;DIS reply 
failure, -1
10/27/2006 12:16:15;0080;PBS_Server;Req;req_reject;Reject reply 
code=15056(Bad DIS based Request Protocol MSG=cannot decode message), 
aux=0, type=Connect, from @
10/27/2006 12:16:15;0002;PBS_Server;Req;dis_reply_write;DIS reply 
failure, -1
10/27/2006 12:23:34;0002;PBS_Server;Req;dis_reply_write;DIS reply 
failure, -1
10/27/2006 12:23:34;0002;PBS_Server;Req;dis_reply_write;DIS reply 
failure, -1
10/27/2006 12:24:18;0080;PBS_Server;Req;req_reject;Reject reply 
code=15056(Bad DIS based Request Protocol MSG=cannot decode message), 
aux=0, type=Connect, from @
10/27/2006 12:24:18;0002;PBS_Server;Req;dis_reply_write;DIS reply 
failure, -1
10/27/2006 12:29:33;0002;PBS_Server;Req;dis_reply_write;DIS reply 
failure, -1
10/27/2006 12:29:33;0002;PBS_Server;Req;dis_reply_write;DIS reply 
failure, -1
10/27/2006 12:29:33;0002;PBS_Server;Req;dis_reply_write;DIS reply 
failure, -1
10/27/2006 12:42:58;0080;PBS_Server;Req;req_reject;Reject reply 
code=15056(Bad DIS based Request Protocol MSG=cannot decode message), 
aux=0, type=Connect, from @
10/27/2006 12:42:58;0002;PBS_Server;Req;dis_reply_write;DIS reply 
failure, -1
10/27/2006 12:45:09;0002;PBS_Server;Req;dis_reply_write;DIS reply 
failure, -1


In mom:
10/27/2006 12:45:07;0002;   pbs_mom;Req;dis_reply_write;DIS reply 
failure, -1

Can anyone help me with this?

-- 
Piotr Siwczak <psiwczak at man.poznan.pl>
System Administrator

Poznan Supercomputing and Networking Center
Supercomputing Department

(www.eu-egee.org <piotr.siwczak at cern.ch>)
-- 



More information about the torqueusers mailing list