[torqueusers] Torque - DIS problems
Piotr Siwczak
psiwczak at man.poznan.pl
Fri Oct 27 04:51:28 MDT 2006
Hi,
We are running an SGI Altix 3700 machine with 128 CPUs and 256GB RAM.
The machine was 64-CPU at start, but we upgraded it recently to 128-CPU.
We also turned fairshare some time ago.
After these 2 events we started to have problems with torque/maui (we
run: 2.0.0p10 and maui3.2.6p13).
Looks like torque server + mom fail to communicate with each other. We
keep getting messages about DIS errors:
In pbs_server:
(grep through logs for DIS):
10/27/2006 08:52:19;0080;PBS_Server;Req;req_reject;Reject reply
code=15056(Bad DIS based Request Protocol MSG=cannot decode message),
aux=0, type=Connect, from @
10/27/2006 08:52:19;0002;PBS_Server;Req;dis_reply_write;DIS reply
failure, -1
10/27/2006 10:16:50;0080;PBS_Server;Req;req_reject;Reject reply
code=15056(Bad DIS based Request Protocol MSG=cannot decode message),
aux=0, type=Connect, from @
10/27/2006 10:16:50;0002;PBS_Server;Req;dis_reply_write;DIS reply
failure, -1
10/27/2006 12:01:52;0002;PBS_Server;Req;dis_reply_write;DIS reply
failure, -1
10/27/2006 12:01:52;0002;PBS_Server;Req;dis_reply_write;DIS reply
failure, -10
10/27/2006 12:01:52;0002;PBS_Server;Req;dis_reply_write;DIS reply
failure, -1
10/27/2006 12:16:15;0080;PBS_Server;Req;req_reject;Reject reply
code=15056(Bad DIS based Request Protocol MSG=cannot decode message),
aux=0, type=Connect, from @
10/27/2006 12:16:15;0002;PBS_Server;Req;dis_reply_write;DIS reply
failure, -1
10/27/2006 12:23:34;0002;PBS_Server;Req;dis_reply_write;DIS reply
failure, -1
10/27/2006 12:23:34;0002;PBS_Server;Req;dis_reply_write;DIS reply
failure, -1
10/27/2006 12:24:18;0080;PBS_Server;Req;req_reject;Reject reply
code=15056(Bad DIS based Request Protocol MSG=cannot decode message),
aux=0, type=Connect, from @
10/27/2006 12:24:18;0002;PBS_Server;Req;dis_reply_write;DIS reply
failure, -1
10/27/2006 12:29:33;0002;PBS_Server;Req;dis_reply_write;DIS reply
failure, -1
10/27/2006 12:29:33;0002;PBS_Server;Req;dis_reply_write;DIS reply
failure, -1
10/27/2006 12:29:33;0002;PBS_Server;Req;dis_reply_write;DIS reply
failure, -1
10/27/2006 12:42:58;0080;PBS_Server;Req;req_reject;Reject reply
code=15056(Bad DIS based Request Protocol MSG=cannot decode message),
aux=0, type=Connect, from @
10/27/2006 12:42:58;0002;PBS_Server;Req;dis_reply_write;DIS reply
failure, -1
10/27/2006 12:45:09;0002;PBS_Server;Req;dis_reply_write;DIS reply
failure, -1
In mom:
10/27/2006 12:45:07;0002; pbs_mom;Req;dis_reply_write;DIS reply
failure, -1
Can anyone help me with this?
--
Piotr Siwczak <psiwczak at man.poznan.pl>
System Administrator
Poznan Supercomputing and Networking Center
Supercomputing Department
(www.eu-egee.org <piotr.siwczak at cern.ch>)
--
More information about the torqueusers
mailing list