[torqueusers] PBS_Server Errors

Azher Mughal azher at hep.caltech.edu
Sun Aug 12 02:10:06 MDT 2012


Hi All,

I have users jobs in queue, but they are not running despite cluster is 
free. I am getting these errors in /var/log/messages:

Aug 11 20:47:30 omega PBS_Server: LOG_ERROR::wait_request, connection 10 
to host 0 has timed out after 900 seconds - closing stale connection
Aug 11 21:47:30 omega PBS_Server: LOG_ERROR::wait_request, connection 10 
to host 0 has timed out after 900 seconds - closing stale connection
Aug 11 23:47:30 omega PBS_Server: LOG_ERROR::wait_request, connection 10 
to host 0 has timed out after 900 seconds - closing stale connection

Some outputs are below. Any suggestions what could be wrong or what else 
needs to be checked.

Thanks
-Azher



[root at omega torque]# qstat -q

server: omega

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
superb             --      --       --      --    0  19 53   E R
babar              --      --       --      --    0   0 --   E R
babar100           --   72:00:00    --      --    0   0 10   E R
minos              --      --       --      --    0 33004 --   E R
dque               --      --       --      --    0   0 --   E R
io                 --      --       --      --    0   0 10   E R
                                                ----- -----
                                                    0 33023

[root at omega torque]# showbf --loglevel=9
INFO:     LOGLEVEL set to 9
MUGetOpt(1,ArgV,C:D:F:hP:V?-:Aa:c:d:f:g:m:M:n:p:q:r:Su:vV,OptArg)
INFO:     flags loaded
INFO:     1 command line args remaining:  'showbf'
MSUConnect(S,FALSE,EMsg)
INFO:     trying to connect to 192.168.1.213 (Port: 40559)
INFO:     non-blocking mode established
MSUSelectWrite(3,30000000)
INFO:     successful connect to TCP server (sd: 3)
MCSendRequest(S)
MSUSendData(S,30000000,TRUE,FALSE)
MSecGetChecksum2(Buf1,27,Buf2,81,Checksum,[NONE],CSKey)
INFO:     header created '00000128
CK=522ae0966230ab95 TS=1344758465 AUTH=root DT='
INFO:     sending short packet '00000128
CK=522ae0966230ab95 TS=1344758465 AUTH=root DT=CMD=showbf AUTH=root 
ARG=root root ALL ALL 0 0 0 0 0 NC 0 0 [NONE] [NONE] [NONE]
'
MSUSendPacket(3,Buf,137,30000000,SC)
INFO:     sending packet '00000128
CK=522ae0966230ab95 TS=1344758465 AUTH=root DT=CMD=showbf AUTH=root 
ARG=root root ALL ALL 0 0 0 0 0 NC 0 0 [NONE] [NONE] [NONE]
'
MSUSelectWrite(3,30000000)
INFO:     packet sent (137 bytes of 137)
INFO:     message sent to server
INFO:     message sent: 'CMD=showbf AUTH=root ARG=root root ALL ALL 0 0 
0 0 0 NC 0 0 [NONE] [NONE] [NONE]
'
MSURecvData(S,30000000,TRUE,SC,EMsg)
MSURecvPacket(3,BufP,9,NULL,30000000,SC)
MSUSelectRead(3,30000000)
MSUSelectRead-select failed
WARNING:  cannot receive message within 30.000000 second timeout (aborting)
ALERT:    cannot determine packet size
ERROR:    lost connection to server
ERROR:    cannot request service (status)


[root at omega torque]# showstart  1584704.omega
ERROR:    lost connection to server
ERROR:    cannot request service (status)


Both Daemons are running:
[root at omega torque]# /etc/init.d/pbs_server status
pbs_server (pid 5057) is running...

[root at omega torque]# /etc/init.d/maui status
maui (pid 6096) is running...





More information about the torqueusers mailing list