[Mauiusers] jobs queued for long time

Chris Berthiaume chrisbee at u.washington.edu
Mon Oct 4 14:24:12 MDT 2010


Hello,

I have jobs that should start running immediately on available resources but at the moment get stuck as queued job for relatively long periods of time, anywhere from 7 minutes to over 12 hours.  On the cluster in question only half of the nodes are being utilized by other jobs, so all new single core jobs should start immediately.  For example, here is the checkjob and tracejob output for a 10 second job I've submitted as a test


# CHECKJOB OUTPUT ###########
checking job 2066650

State: Idle
Creds:  user:chrisbee  group:chrisbee  class:short  qos:DEFAULT
WallTime: 00:00:00 of 00:00:10
SubmitTime: Mon Oct  4 12:27:11
 (Time Queued  Total: 00:00:01  Eligible: 00:00:01)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [16GB]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE

PE:  1.00  StartPriority:  91
job can run in partition DEFAULT (128 procs available.  1 procs required)
##############################

# TRACEJOB OUPUT ###########
Job: 2066650

10/04/2010 12:27:11  S    enqueuing into route, state 1 hop 1
10/04/2010 12:27:11  S    dequeuing from route, state QUEUED
10/04/2010 12:27:11  S    enqueuing into short, state 1 hop 1
10/04/2010 12:27:11  S    Job Queued at request of chrisbee at bloom,
                         owner = chrisbee at bloom, job name = STDIN,
                         queue = short
10/04/2010 12:27:11  A    queue=route
10/04/2010 12:27:11  A    queue=short
10/04/2010 12:33:54  S    Job Modified at request of maui at bloom
10/04/2010 12:33:54  S    Job Run at request of maui at bloom
10/04/2010 12:33:54  S    Job Modified at request of maui at bloom
10/04/2010 12:33:54  S    Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb
                         resources_used.vmem=0kb resources_used.walltime=00:00:00
10/04/2010 12:33:54  A    user=chrisbee group=chrisbee jobname=STDIN queue=short
                         ctime=1286220431 qtime=1286220431 etime=1286220431
                         start=1286220834 owner=chrisbee at bloom
                         exec_host=compute-0-31/0 Resource_List.neednodes=compute-0-31
                         Resource_List.nodect=1 Resource_List.nodes=1
                         Resource_List.walltime=00:00:10 
10/04/2010 12:33:54  A    user=chrisbee group=chrisbee jobname=STDIN queue=short
                         ctime=1286220431 qtime=1286220431 etime=1286220431
                         start=1286220834 owner=chrisbee at bloom
                         exec_host=compute-0-31/0 Resource_List.neednodes=1
                         Resource_List.nodect=1 Resource_List.nodes=1
                         Resource_List.walltime=00:00:10 session=7699 end=1286220834
                         Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb
                         resources_used.vmem=0kb resources_used.walltime=00:00:00
##############################

It appears that it should be able to run right away, but it actually takes almost 7 minutes just to start running.

I'm using torque version 2.3.6 and maui version 3.2.6p21.

Any help in sorting out why these jobs don't start right away would be greatly appreciated.

Thanks,
Chris


-- 
Chris Berthiaume
Center for Environmental Genomics
University of Washington


More information about the mauiusers mailing list