[torqueusers] jobs stuck in queue until I force execution with qrun

Christina Salls christina.salls at noaa.gov
Thu Feb 16 13:10:17 MST 2012


Hi all,

         My situation has improved but I am still not there.  I can submit
a job successfully, but it will stay in the queue until I force execution
with qrun.

eg.

-bash-4.1$ qsub ./example_submit_script_1
22.admin.default.domain
-bash-4.1$ qstat -a

admin.default.domain:

 Req'd  Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK
Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- ---
------ ----- - -----
22.admin.default     salls    batch    ExampleJob          --      1   1
 --  00:01 Q   --

.[root at wings ~]# qrun 22
[root at wings ~]# qstat -a

admin.default.domain:

 Req'd  Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK
Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- ---
------ ----- - -----
22.admin.default     salls    batch    ExampleJob        30429     1   1
 --  00:01 R   --

[root at wings ~]# qstat -a

admin.default.domain:

 Req'd  Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK
Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- ---
------ ----- - -----
22.admin.default     salls    batch    ExampleJob        30429     1   1
 --  00:01 C 00:00
[root at wings ~]#


This is what tracejob output looks like:

[root at wings ~]# tracejob 22
/var/spool/torque/mom_logs/20120216: No such file or directory
/var/spool/torque/sched_logs/20120216: No matching job records located

Job: 22.admin.default.domain

02/16/2012 13:46:51  S    enqueuing into batch, state 1 hop 1
02/16/2012 13:46:51  S    Job Queued at request of
salls at admin.default.domain, owner = salls at admin.default.domain,
                          job name = ExampleJob, queue = batch
02/16/2012 13:46:51  A    queue=batch
02/16/2012 13:53:53  S    Job Run at request of root at admin.default.domain
02/16/2012 13:53:53  S    Not sending email: User does not want mail of
this type.
02/16/2012 13:53:53  A    user=salls group=man jobname=ExampleJob
queue=batch ctime=1329421611 qtime=1329421611
                          etime=1329421611 start=1329422033
owner=salls at admin.default.domain
                          exec_host=n001.default.domain/0
Resource_List.neednodes=1 Resource_List.nodect=1
                          Resource_List.nodes=1
Resource_List.walltime=00:01:00
02/16/2012 13:54:03  S    Not sending email: User does not want mail of
this type.
02/16/2012 13:54:03  S    Exit_status=0 resources_used.cput=00:00:00
resources_used.mem=0kb resources_used.vmem=0kb
                          resources_used.walltime=00:00:10
02/16/2012 13:54:03  A    user=salls group=man jobname=ExampleJob
queue=batch ctime=1329421611 qtime=1329421611
                          etime=1329421611 start=1329422033
owner=salls at admin.default.domain
                          exec_host=n001.default.domain/0
Resource_List.neednodes=1 Resource_List.nodect=1
                          Resource_List.nodes=1
Resource_List.walltime=00:01:00 session=30429 end=1329422043
                          Exit_status=0 resources_used.cput=00:00:00
resources_used.mem=0kb resources_used.vmem=0kb
                          resources_used.walltime=00:00:10


This is what the output files look like:

-bash-4.1$ more ExampleJob.o22
Thu Feb 16 13:53:53 CST 2012
Thu Feb 16 13:54:03 CST 2012
-bash-4.1$ more ExampleJob.e22
-bash-4.1$

This is my basic server config:

[root at wings ~]# qmgr
Max open servers: 10239
Qmgr: print server
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = admin.default.domain
set server acl_hosts += wings.glerl.noaa.gov
set server managers = root at wings.glerl.noaa.gov
set server managers += salls at wings.glerl.noaa.gov
set server operators = root at wings.glerl.noaa.gov
set server operators += salls at wings.glerl.noaa.gov
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server mom_job_sync = True
set server keep_completed = 300
set server next_job_number = 23

Processes running on server:

root     32086     1  0 13:23 ?        00:00:00 /usr/local/sbin/pbs_server
-d /var/spool/torque -H admin.default.domain
root     32173     1  0 13:23 ?        00:00:00 /usr/local/sbin/pbs_sched
-d /var/spool/torque


My sched_config file looks like this.  I left the default values as is.

[root at wings sched_priv]# more sched_config


# This is the config file for the scheduling policy
# FORMAT:  option: value prime_option
# option - the name of what we are changing defined in config.h
# value   - can be boolean/string/numeric depending on the option
# prime_option - can be prime/non_prime/all ONLY FOR SOME OPTIONS

# Round Robin -
# run a job from each queue before running second job from the
# first queue.

round_robin: False all


# By Queue -
# run jobs by queues.
#       If it is not set, the scheduler will look at all the jobs on
#       on the server as one large queue, and ignore the queues set
#       by the administrator
# PRIME OPTION

by_queue: True prime
by_queue: True non_prime


# Strict Fifo -
# run jobs in strict fifo order.  If one job can not run
# move onto the next queue and do not run any more jobs
# out of that queue even if some jobs in the queue could
# be run.
# If it is not set, it could very easily starve the large
# resource using jobs.
# PRIME OPTION

strict_fifo: false ALL

#
# fair_share - schedule jobs based on usage and share values
# PRIME OPTION
#
fair_share: false ALL

# Help Starving Jobs -
# Jobs which have been waiting a long time will
# be considered starving.  Once a job is considered
# starving, the scheduler will not run any jobs
# until it can run all of the starving jobs.
# PRIME OPTION

help_starving_jobs true ALL

#
# sort_queues - sort queues by the priority attribute
# PRIME OPTION
#
sort_queues true ALL

#
# load_balancing - load balance between timesharing nodes
# PRIME OPTION
#
load_balancing: false ALL

# sort_by:
# key:
# to sort the jobs on one key, specify it by sort_by
# If multiple sorts are necessary, set sory_by to multi_sort
# specify the keys in order of sorting

# if round_robin or by_queue is set, the jobs will be sorted in their
# respective queues.  If not the entire server will be sorted.

# different sorts - defined in globals.c
# no_sort shortest_job_first longest_job_first smallest_memory_first
# largest_memory_first high_priority_first low_priority_first multi_sort
# fair_share large_walltime_first short_walltime_first
#
# PRIME OPTION
sort_by: shortest_job_first ALL

# filter out prolific debug messages
# 256 are DEBUG2 messages
# NO PRIME OPTION
log_filter: 256

# all queues starting with this value are dedicated time queues
# i.e. dedtime or dedicatedtime would be dedtime queues
# NO PRIME OPTION
dedicated_prefix: ded

# ignored queues
# you can specify up to 16 queues to be ignored by the scheduler
#ignore_queue: queue_name

# this defines how long before a job is considered starving.  If a job has
# been queued for this long, it will be considered starving
# NO PRIME OPTION
max_starve: 24:00:00

# The following three config values are meaningless with fair share turned
off

# half_life - the half life of usage for fair share
# NO PRIME OPTION
half_life: 24:00:00

# unknown_shares - the number of shares for the "unknown" group
# NO PRIME OPTION
unknown_shares: 10

# sync_time - the amount of time between syncing the usage information to
disk
# NO PRIME OPTION
sync_time: 1:00:00


Any idea what I need to do?

Thanks,

      Christina


-- 
Christina A. Salls
GLERL Computer Group
help.glerl at noaa.gov
Help Desk x2127
Christina.Salls at noaa.gov
Voice Mail 734-741-2446
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120216/eba41a0d/attachment-0001.html 


More information about the torqueusers mailing list