[Mauiusers] jobs queued for long time
Chris Berthiaume
chrisbee at u.washington.edu
Mon Oct 4 14:24:12 MDT 2010
Hello,
I have jobs that should start running immediately on available resources but at the moment get stuck as queued job for relatively long periods of time, anywhere from 7 minutes to over 12 hours. On the cluster in question only half of the nodes are being utilized by other jobs, so all new single core jobs should start immediately. For example, here is the checkjob and tracejob output for a 10 second job I've submitted as a test
# CHECKJOB OUTPUT ###########
checking job 2066650
State: Idle
Creds: user:chrisbee group:chrisbee class:short qos:DEFAULT
WallTime: 00:00:00 of 00:00:10
SubmitTime: Mon Oct 4 12:27:11
(Time Queued Total: 00:00:01 Eligible: 00:00:01)
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [16GB]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 0
PartitionMask: [ALL]
Flags: RESTARTABLE
PE: 1.00 StartPriority: 91
job can run in partition DEFAULT (128 procs available. 1 procs required)
##############################
# TRACEJOB OUPUT ###########
Job: 2066650
10/04/2010 12:27:11 S enqueuing into route, state 1 hop 1
10/04/2010 12:27:11 S dequeuing from route, state QUEUED
10/04/2010 12:27:11 S enqueuing into short, state 1 hop 1
10/04/2010 12:27:11 S Job Queued at request of chrisbee at bloom,
owner = chrisbee at bloom, job name = STDIN,
queue = short
10/04/2010 12:27:11 A queue=route
10/04/2010 12:27:11 A queue=short
10/04/2010 12:33:54 S Job Modified at request of maui at bloom
10/04/2010 12:33:54 S Job Run at request of maui at bloom
10/04/2010 12:33:54 S Job Modified at request of maui at bloom
10/04/2010 12:33:54 S Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb
resources_used.vmem=0kb resources_used.walltime=00:00:00
10/04/2010 12:33:54 A user=chrisbee group=chrisbee jobname=STDIN queue=short
ctime=1286220431 qtime=1286220431 etime=1286220431
start=1286220834 owner=chrisbee at bloom
exec_host=compute-0-31/0 Resource_List.neednodes=compute-0-31
Resource_List.nodect=1 Resource_List.nodes=1
Resource_List.walltime=00:00:10
10/04/2010 12:33:54 A user=chrisbee group=chrisbee jobname=STDIN queue=short
ctime=1286220431 qtime=1286220431 etime=1286220431
start=1286220834 owner=chrisbee at bloom
exec_host=compute-0-31/0 Resource_List.neednodes=1
Resource_List.nodect=1 Resource_List.nodes=1
Resource_List.walltime=00:00:10 session=7699 end=1286220834
Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb
resources_used.vmem=0kb resources_used.walltime=00:00:00
##############################
It appears that it should be able to run right away, but it actually takes almost 7 minutes just to start running.
I'm using torque version 2.3.6 and maui version 3.2.6p21.
Any help in sorting out why these jobs don't start right away would be greatly appreciated.
Thanks,
Chris
--
Chris Berthiaume
Center for Environmental Genomics
University of Washington
More information about the mauiusers
mailing list