[torqueusers] PBS_Server Errors
Azher Mughal
azher at hep.caltech.edu
Sun Aug 12 02:10:06 MDT 2012
Hi All,
I have users jobs in queue, but they are not running despite cluster is
free. I am getting these errors in /var/log/messages:
Aug 11 20:47:30 omega PBS_Server: LOG_ERROR::wait_request, connection 10
to host 0 has timed out after 900 seconds - closing stale connection
Aug 11 21:47:30 omega PBS_Server: LOG_ERROR::wait_request, connection 10
to host 0 has timed out after 900 seconds - closing stale connection
Aug 11 23:47:30 omega PBS_Server: LOG_ERROR::wait_request, connection 10
to host 0 has timed out after 900 seconds - closing stale connection
Some outputs are below. Any suggestions what could be wrong or what else
needs to be checked.
Thanks
-Azher
[root at omega torque]# qstat -q
server: omega
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
superb -- -- -- -- 0 19 53 E R
babar -- -- -- -- 0 0 -- E R
babar100 -- 72:00:00 -- -- 0 0 10 E R
minos -- -- -- -- 0 33004 -- E R
dque -- -- -- -- 0 0 -- E R
io -- -- -- -- 0 0 10 E R
----- -----
0 33023
[root at omega torque]# showbf --loglevel=9
INFO: LOGLEVEL set to 9
MUGetOpt(1,ArgV,C:D:F:hP:V?-:Aa:c:d:f:g:m:M:n:p:q:r:Su:vV,OptArg)
INFO: flags loaded
INFO: 1 command line args remaining: 'showbf'
MSUConnect(S,FALSE,EMsg)
INFO: trying to connect to 192.168.1.213 (Port: 40559)
INFO: non-blocking mode established
MSUSelectWrite(3,30000000)
INFO: successful connect to TCP server (sd: 3)
MCSendRequest(S)
MSUSendData(S,30000000,TRUE,FALSE)
MSecGetChecksum2(Buf1,27,Buf2,81,Checksum,[NONE],CSKey)
INFO: header created '00000128
CK=522ae0966230ab95 TS=1344758465 AUTH=root DT='
INFO: sending short packet '00000128
CK=522ae0966230ab95 TS=1344758465 AUTH=root DT=CMD=showbf AUTH=root
ARG=root root ALL ALL 0 0 0 0 0 NC 0 0 [NONE] [NONE] [NONE]
'
MSUSendPacket(3,Buf,137,30000000,SC)
INFO: sending packet '00000128
CK=522ae0966230ab95 TS=1344758465 AUTH=root DT=CMD=showbf AUTH=root
ARG=root root ALL ALL 0 0 0 0 0 NC 0 0 [NONE] [NONE] [NONE]
'
MSUSelectWrite(3,30000000)
INFO: packet sent (137 bytes of 137)
INFO: message sent to server
INFO: message sent: 'CMD=showbf AUTH=root ARG=root root ALL ALL 0 0
0 0 0 NC 0 0 [NONE] [NONE] [NONE]
'
MSURecvData(S,30000000,TRUE,SC,EMsg)
MSURecvPacket(3,BufP,9,NULL,30000000,SC)
MSUSelectRead(3,30000000)
MSUSelectRead-select failed
WARNING: cannot receive message within 30.000000 second timeout (aborting)
ALERT: cannot determine packet size
ERROR: lost connection to server
ERROR: cannot request service (status)
[root at omega torque]# showstart 1584704.omega
ERROR: lost connection to server
ERROR: cannot request service (status)
Both Daemons are running:
[root at omega torque]# /etc/init.d/pbs_server status
pbs_server (pid 5057) is running...
[root at omega torque]# /etc/init.d/maui status
maui (pid 6096) is running...
More information about the torqueusers
mailing list