[torqueusers] Job stuck in queue forever
Dan Roberts
Daniel.G.Roberts at sanofi-aventis.com
Thu Aug 24 06:45:49 MDT 2006
Hello All
On a somewhat regular basis on my 60 compute node linux cluster I have a
job or two queued forever and ever.
In general I notice this type of problem when I submit 30-40 jobs all at
once with each job requesting 1 node 1cpu..
Most of the jobs run just fine in a few minutes but one or two always
"hangs" and stays queued until I qrun the job by hand..
Can anyone help me figure out why this is consistently happening..
It seems that everytime this happens I see in the torque log right after
my troublesome job this note
08/23/2006 12:31:25;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in contact_sched, Could not contact Scheduler - port 15004
Does the above mean my PBS_SERVER crashed for an instant?
Why when the system comes back online, if in fact there is a hicup,
doesn't my job just get run?
The case example is located below
Thanks for any help!
Dan
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
40003.tucslnxc1- TL059 nm31306 0 Q
ghts
[root at tucslnxc1-b log]#
[root at tucslnxc1-b log]# checkjob -v 40003
checking job 40003 (RM job '40003.tu¨Ô')
State: Idle EState: Deferred
Creds: user:nm31306 group:compchem class:ghts qos:medium
WallTime: 00:00:00 of 99:23:59:59
SubmitTime: Wed Aug 23 12:31:25
(Time Queued Total: 17:04:10 Eligible: 00:00:00)
StartDate: -00:00:54 Thu Aug 24 05:34:41
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
Exec: '' ExecSize: 0 ImageSize: 0
Dedicated Resources Per Task: PROCS: 1
NodeAccess: SHARED
NodeCount: 0
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 18
PartitionMask: [ALL]
SystemQueueTime: Thu Aug 24 05:34:40
Flags: RESTARTABLE
job is deferred. Reason: RMFailure (job cannot be started - cannot
set hostlist)
Holds: Defer (hold reason: RMFailure)
PE: 1.00 StartPriority: 1010
cannot select job 40003 for partition DEFAULT (job hold active)
I see this as well>
[root at tucslnxc1-b log]# ls -l /var/spool/torque/server_priv/jobs/
total 8
-rw------- 1 root root 2659 Aug 23 12:31 40003.tucsl.JB
-rw------- 1 root root 196 Aug 23 12:31 40003.tucsl.SC
The torque log shows nothing in regards to this job beyond>>
08/23/2006 12:31:25;0100;PBS_Server;Job;40003.tucslnxc1-
b.tuc.pharma.aventis.com;enq
ueuing into ghts, state 1 hop 1
08/23/2006 12:31:25;0008;PBS_Server;Job;40003.tucslnxc1-
b.tuc.pharma.aventis.com;Job
Queued at request of nm31306 at tucslnxc1-b.tuc.pharma.aventis.com, owner
= nm31306 at tu
cslnxc1-b.tuc.pharma.aventis.com, job name = TL059, queue = ghts
08/23/2006 12:31:25;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in conta
ct_sched, Could not contact Scheduler - port 15004
08/23/2006 12:31:26;0001;PBS_Server;Svr;PBS_Server;is_request, bad
attempt to connec
t from 192.168.1.5:1023 (address not trusted)
08/23/2006 12:31:35;0100;PBS_Server;Req;;Type AuthenticateUser request
received from
nm31306 at tucslnxc1-b.tuc.pharma.aventis.com, sock=10
08/23/2006 12:31:35;0100;PBS_Server;Req;;Type QueueJob request received
from nm31306
@tucslnxc1-b.tuc.pharma.aventis.com, sock=9
08/23/2006 12:31:35;0100;PBS_Server;Req;;Type JobScript request received
from nm3130
6 at tucslnxc1-b.tuc.pharma.aventis.com, sock=9
08/23/2006 12:31:35;0100;PBS_Server;Req;;Type ReadyToCommit request
received from nm
31306 at tucslnxc1-b.tuc.pharma.aventis.com, sock=9
08/23/2006 12:31:35;0100;PBS_Server;Req;;Type Commit request received
from nm31306 at t
ucslnxc1-b.tuc.pharma.aventis.com, sock=9
08/23/2006 12:31:35;0100;PBS_Server;Job;40004.tucslnxc1-
b.tuc.pharma.aventis.com;enq
ueuing into ghts, state 1 hop 1
08/23/2006 12:31:35;0008;PBS_Server;Job;40004.tucslnxc1-
b.tuc.pharma.aventis.com;Job
Queued at request of nm31306 at tucslnxc1-b.tuc.pharma.aventis.com, owner
= nm31306 at tu
cslnxc1-b.tuc.pharma.aventis.com, job name = TL026, queue = ghts
08/23/2006 12:31:35;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in conta
ct_sched, Could not contact Scheduler - port 15004
08/23/2006 12:31:45;0100;PBS_Server;Req;;Type AuthenticateUser request
received from
nm31306 at tucslnxc1-b.tuc.pharma.aventis.com, sock=10
--More--(64%)
I see in my maui.log file the start priority of the job is increasing
forever..
example
List,0)
08/24 05:26:41 INFO: 1 PBS jobs detected on RM tucslnxc1-
b.tuc.pharma.aventi
s.com
08/24 05:26:41 INFO: jobs detected: 1
08/24 05:26:41 MStatClearUsage(node,Active)
08/24 05:26:41 MClusterUpdateNodeState()
08/24 05:26:41 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg)
08/24 05:26:41 INFO: job '40003' Priority: 1521
08/24 05:26:41 INFO: Cred: 1000(00.0) FS: 0(00.0) Attr:
0(00.0
) Serv: 521(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0
(00.0)
08/24 05:26:41 MStatClearUsage([NONE],Active)
08/24 05:26:41 MResDestroy(NULL)
08/24 05:26:41 INFO: total jobs selected (ALL): 0/1 [EState: 1]
08/24 05:26:41 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg)
08/24 05:26:41 INFO: job '40003' Priority: 1521
08/24 05:26:41 INFO: Cred: 1000(00.0) FS: 0(00.0) Attr:
0(00.0
) Serv: 521(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0
(00.0)
08/24 05:26:41 MStatClearUsage([NONE],Idle)
08/24 05:26:41 MResDestroy(NULL)
08/24 05:26:41 INFO: total jobs selected (ALL): 0/1 [EState: 1]
<SNIP><SNIP><SNIP><SNIP><SNIP><SNIP><SNIP><SNIP><SNIP>
<SNIP>
08/24 05:27:31 INFO: job '40003' Priority: 1530
08/24 05:27:31 INFO: Cred: 1000(00.0) FS: 0(00.0) Attr:
0(00.0) S
erv: 530(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0
(00.0)
08/24 05:27:31 MStatClearUsage([NONE],Active)
08/24 05:27:31 MResDestroy(NULL)
08/24 05:27:31 INFO: total jobs selected (ALL): 0/1 [EState: 1]
08/24 05:27:31 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg)
08/24 05:27:31 INFO: job '40003' Priority: 1530
08/24 05:27:31 INFO: Cred: 1000(00.0) FS: 0(00.0) Attr:
0(00.0) S
erv: 530(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0
(00.0)
08/24 05:27:31 MStatClearUsage([NONE],Idle)
08/24 05:27:31 MResDestroy(NULL)
08/24 05:27:31 INFO: total jobs selected (ALL): 0/1 [EState: 1]
08/24 05:27:31 MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,
and etc and etc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20060824/df271ca9/attachment-0001.html
More information about the torqueusers
mailing list