[torqueusers] Job stuck in queue forever

Dan Roberts Daniel.G.Roberts at sanofi-aventis.com
Thu Aug 24 06:45:49 MDT 2006


Hello All

On a somewhat regular basis on my 60 compute node linux cluster I have a
job or two queued forever and ever.
In general I notice this type of problem when I submit 30-40 jobs all at
once with each job requesting 1 node 1cpu..
Most of the jobs run just fine in a few minutes but one or two always
"hangs" and stays queued until I qrun the job by hand..
Can anyone help me figure out why this is consistently happening..
It seems that everytime this happens I see in the torque log right after
my troublesome job this note

08/23/2006 12:31:25;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in contact_sched, Could not contact Scheduler - port 15004

Does the above mean my PBS_SERVER crashed for an instant?
Why when the system comes back online, if in fact there is a hicup,
doesn't my job just get run?


The case example is located below

Thanks for any help!
Dan


Job id           Name             User             Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
40003.tucslnxc1- TL059            nm31306                 0 Q
ghts            
[root at tucslnxc1-b log]# 



[root at tucslnxc1-b log]# checkjob -v 40003


checking job 40003 (RM job '40003.tu¨Ô')

State: Idle  EState: Deferred
Creds:  user:nm31306  group:compchem  class:ghts  qos:medium
WallTime: 00:00:00 of 99:23:59:59
SubmitTime: Wed Aug 23 12:31:25
  (Time Queued  Total: 17:04:10  Eligible: 00:00:00)

StartDate: -00:00:54  Thu Aug 24 05:34:41
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Exec:  ''  ExecSize: 0  ImageSize: 0
Dedicated Resources Per Task: PROCS: 1
NodeAccess: SHARED
NodeCount: 0


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 18
PartitionMask: [ALL]
SystemQueueTime: Thu Aug 24 05:34:40

Flags:       RESTARTABLE

job is deferred.  Reason:  RMFailure  (job cannot be started - cannot
set hostlist)
Holds:    Defer  (hold reason:  RMFailure)
PE:  1.00  StartPriority:  1010
cannot select job 40003 for partition DEFAULT (job hold active)

I see this as well>

[root at tucslnxc1-b log]# ls -l /var/spool/torque/server_priv/jobs/
total 8
-rw-------    1 root     root         2659 Aug 23 12:31 40003.tucsl.JB
-rw-------    1 root     root          196 Aug 23 12:31 40003.tucsl.SC



The torque log shows nothing in regards to this job beyond>>

08/23/2006 12:31:25;0100;PBS_Server;Job;40003.tucslnxc1-
b.tuc.pharma.aventis.com;enq
ueuing into ghts, state 1 hop 1
08/23/2006 12:31:25;0008;PBS_Server;Job;40003.tucslnxc1-
b.tuc.pharma.aventis.com;Job
 Queued at request of nm31306 at tucslnxc1-b.tuc.pharma.aventis.com, owner
= nm31306 at tu
cslnxc1-b.tuc.pharma.aventis.com, job name = TL059, queue = ghts
08/23/2006 12:31:25;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in conta
ct_sched, Could not contact Scheduler - port 15004
08/23/2006 12:31:26;0001;PBS_Server;Svr;PBS_Server;is_request, bad
attempt to connec
t from 192.168.1.5:1023 (address not trusted)
08/23/2006 12:31:35;0100;PBS_Server;Req;;Type AuthenticateUser request
received from
 nm31306 at tucslnxc1-b.tuc.pharma.aventis.com, sock=10
08/23/2006 12:31:35;0100;PBS_Server;Req;;Type QueueJob request received
from nm31306
@tucslnxc1-b.tuc.pharma.aventis.com, sock=9
08/23/2006 12:31:35;0100;PBS_Server;Req;;Type JobScript request received
from nm3130
6 at tucslnxc1-b.tuc.pharma.aventis.com, sock=9
08/23/2006 12:31:35;0100;PBS_Server;Req;;Type ReadyToCommit request
received from nm
31306 at tucslnxc1-b.tuc.pharma.aventis.com, sock=9
08/23/2006 12:31:35;0100;PBS_Server;Req;;Type Commit request received
from nm31306 at t
ucslnxc1-b.tuc.pharma.aventis.com, sock=9
08/23/2006 12:31:35;0100;PBS_Server;Job;40004.tucslnxc1-
b.tuc.pharma.aventis.com;enq
ueuing into ghts, state 1 hop 1
08/23/2006 12:31:35;0008;PBS_Server;Job;40004.tucslnxc1-
b.tuc.pharma.aventis.com;Job
 Queued at request of nm31306 at tucslnxc1-b.tuc.pharma.aventis.com, owner
= nm31306 at tu
cslnxc1-b.tuc.pharma.aventis.com, job name = TL026, queue = ghts
08/23/2006 12:31:35;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in conta
ct_sched, Could not contact Scheduler - port 15004
08/23/2006 12:31:45;0100;PBS_Server;Req;;Type AuthenticateUser request
received from
 nm31306 at tucslnxc1-b.tuc.pharma.aventis.com, sock=10
--More--(64%)



I see in my maui.log file the start priority of the job is increasing
forever..
example
List,0)
08/24 05:26:41 INFO:     1 PBS jobs detected on RM tucslnxc1-
b.tuc.pharma.aventi
s.com
08/24 05:26:41 INFO:     jobs detected: 1
08/24 05:26:41 MStatClearUsage(node,Active)
08/24 05:26:41 MClusterUpdateNodeState()
08/24 05:26:41 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg)
08/24 05:26:41 INFO:     job '40003' Priority:     1521
08/24 05:26:41 INFO:     Cred:   1000(00.0)  FS:      0(00.0)  Attr:
0(00.0
)  Serv:    521(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:      0
(00.0)
08/24 05:26:41 MStatClearUsage([NONE],Active)
08/24 05:26:41 MResDestroy(NULL)
08/24 05:26:41 INFO:     total jobs selected (ALL): 0/1 [EState: 1]
08/24 05:26:41 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg)
08/24 05:26:41 INFO:     job '40003' Priority:     1521
08/24 05:26:41 INFO:     Cred:   1000(00.0)  FS:      0(00.0)  Attr:
0(00.0
)  Serv:    521(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:      0
(00.0)
08/24 05:26:41 MStatClearUsage([NONE],Idle)
08/24 05:26:41 MResDestroy(NULL)
08/24 05:26:41 INFO:     total jobs selected (ALL): 0/1 [EState: 1]

<SNIP><SNIP><SNIP><SNIP><SNIP><SNIP><SNIP><SNIP><SNIP>
<SNIP>


08/24 05:27:31 INFO:     job '40003' Priority:     1530
08/24 05:27:31 INFO:     Cred:   1000(00.0)  FS:      0(00.0)  Attr:
0(00.0)  S
erv:    530(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:      0
(00.0)
08/24 05:27:31 MStatClearUsage([NONE],Active)
08/24 05:27:31 MResDestroy(NULL)
08/24 05:27:31 INFO:     total jobs selected (ALL): 0/1 [EState: 1]
08/24 05:27:31 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg)
08/24 05:27:31 INFO:     job '40003' Priority:     1530
08/24 05:27:31 INFO:     Cred:   1000(00.0)  FS:      0(00.0)  Attr:
0(00.0)  S
erv:    530(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:      0
(00.0)
08/24 05:27:31 MStatClearUsage([NONE],Idle)
08/24 05:27:31 MResDestroy(NULL)
08/24 05:27:31 INFO:     total jobs selected (ALL): 0/1 [EState: 1]
08/24 05:27:31 MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,

and etc and etc



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20060824/df271ca9/attachment-0001.html


More information about the torqueusers mailing list