[torqueusers] jobs stuck in queue until I force execution with qrun
Christina Salls
christina.salls at noaa.gov
Thu Feb 16 13:10:17 MST 2012
Hi all,
My situation has improved but I am still not there. I can submit
a job successfully, but it will stay in the queue until I force execution
with qrun.
eg.
-bash-4.1$ qsub ./example_submit_script_1
22.admin.default.domain
-bash-4.1$ qstat -a
admin.default.domain:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK
Memory Time S Time
-------------------- -------- -------- ---------------- ------ ----- ---
------ ----- - -----
22.admin.default salls batch ExampleJob -- 1 1
-- 00:01 Q --
.[root at wings ~]# qrun 22
[root at wings ~]# qstat -a
admin.default.domain:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK
Memory Time S Time
-------------------- -------- -------- ---------------- ------ ----- ---
------ ----- - -----
22.admin.default salls batch ExampleJob 30429 1 1
-- 00:01 R --
[root at wings ~]# qstat -a
admin.default.domain:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK
Memory Time S Time
-------------------- -------- -------- ---------------- ------ ----- ---
------ ----- - -----
22.admin.default salls batch ExampleJob 30429 1 1
-- 00:01 C 00:00
[root at wings ~]#
This is what tracejob output looks like:
[root at wings ~]# tracejob 22
/var/spool/torque/mom_logs/20120216: No such file or directory
/var/spool/torque/sched_logs/20120216: No matching job records located
Job: 22.admin.default.domain
02/16/2012 13:46:51 S enqueuing into batch, state 1 hop 1
02/16/2012 13:46:51 S Job Queued at request of
salls at admin.default.domain, owner = salls at admin.default.domain,
job name = ExampleJob, queue = batch
02/16/2012 13:46:51 A queue=batch
02/16/2012 13:53:53 S Job Run at request of root at admin.default.domain
02/16/2012 13:53:53 S Not sending email: User does not want mail of
this type.
02/16/2012 13:53:53 A user=salls group=man jobname=ExampleJob
queue=batch ctime=1329421611 qtime=1329421611
etime=1329421611 start=1329422033
owner=salls at admin.default.domain
exec_host=n001.default.domain/0
Resource_List.neednodes=1 Resource_List.nodect=1
Resource_List.nodes=1
Resource_List.walltime=00:01:00
02/16/2012 13:54:03 S Not sending email: User does not want mail of
this type.
02/16/2012 13:54:03 S Exit_status=0 resources_used.cput=00:00:00
resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:10
02/16/2012 13:54:03 A user=salls group=man jobname=ExampleJob
queue=batch ctime=1329421611 qtime=1329421611
etime=1329421611 start=1329422033
owner=salls at admin.default.domain
exec_host=n001.default.domain/0
Resource_List.neednodes=1 Resource_List.nodect=1
Resource_List.nodes=1
Resource_List.walltime=00:01:00 session=30429 end=1329422043
Exit_status=0 resources_used.cput=00:00:00
resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:10
This is what the output files look like:
-bash-4.1$ more ExampleJob.o22
Thu Feb 16 13:53:53 CST 2012
Thu Feb 16 13:54:03 CST 2012
-bash-4.1$ more ExampleJob.e22
-bash-4.1$
This is my basic server config:
[root at wings ~]# qmgr
Max open servers: 10239
Qmgr: print server
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = admin.default.domain
set server acl_hosts += wings.glerl.noaa.gov
set server managers = root at wings.glerl.noaa.gov
set server managers += salls at wings.glerl.noaa.gov
set server operators = root at wings.glerl.noaa.gov
set server operators += salls at wings.glerl.noaa.gov
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server mom_job_sync = True
set server keep_completed = 300
set server next_job_number = 23
Processes running on server:
root 32086 1 0 13:23 ? 00:00:00 /usr/local/sbin/pbs_server
-d /var/spool/torque -H admin.default.domain
root 32173 1 0 13:23 ? 00:00:00 /usr/local/sbin/pbs_sched
-d /var/spool/torque
My sched_config file looks like this. I left the default values as is.
[root at wings sched_priv]# more sched_config
# This is the config file for the scheduling policy
# FORMAT: option: value prime_option
# option - the name of what we are changing defined in config.h
# value - can be boolean/string/numeric depending on the option
# prime_option - can be prime/non_prime/all ONLY FOR SOME OPTIONS
# Round Robin -
# run a job from each queue before running second job from the
# first queue.
round_robin: False all
# By Queue -
# run jobs by queues.
# If it is not set, the scheduler will look at all the jobs on
# on the server as one large queue, and ignore the queues set
# by the administrator
# PRIME OPTION
by_queue: True prime
by_queue: True non_prime
# Strict Fifo -
# run jobs in strict fifo order. If one job can not run
# move onto the next queue and do not run any more jobs
# out of that queue even if some jobs in the queue could
# be run.
# If it is not set, it could very easily starve the large
# resource using jobs.
# PRIME OPTION
strict_fifo: false ALL
#
# fair_share - schedule jobs based on usage and share values
# PRIME OPTION
#
fair_share: false ALL
# Help Starving Jobs -
# Jobs which have been waiting a long time will
# be considered starving. Once a job is considered
# starving, the scheduler will not run any jobs
# until it can run all of the starving jobs.
# PRIME OPTION
help_starving_jobs true ALL
#
# sort_queues - sort queues by the priority attribute
# PRIME OPTION
#
sort_queues true ALL
#
# load_balancing - load balance between timesharing nodes
# PRIME OPTION
#
load_balancing: false ALL
# sort_by:
# key:
# to sort the jobs on one key, specify it by sort_by
# If multiple sorts are necessary, set sory_by to multi_sort
# specify the keys in order of sorting
# if round_robin or by_queue is set, the jobs will be sorted in their
# respective queues. If not the entire server will be sorted.
# different sorts - defined in globals.c
# no_sort shortest_job_first longest_job_first smallest_memory_first
# largest_memory_first high_priority_first low_priority_first multi_sort
# fair_share large_walltime_first short_walltime_first
#
# PRIME OPTION
sort_by: shortest_job_first ALL
# filter out prolific debug messages
# 256 are DEBUG2 messages
# NO PRIME OPTION
log_filter: 256
# all queues starting with this value are dedicated time queues
# i.e. dedtime or dedicatedtime would be dedtime queues
# NO PRIME OPTION
dedicated_prefix: ded
# ignored queues
# you can specify up to 16 queues to be ignored by the scheduler
#ignore_queue: queue_name
# this defines how long before a job is considered starving. If a job has
# been queued for this long, it will be considered starving
# NO PRIME OPTION
max_starve: 24:00:00
# The following three config values are meaningless with fair share turned
off
# half_life - the half life of usage for fair share
# NO PRIME OPTION
half_life: 24:00:00
# unknown_shares - the number of shares for the "unknown" group
# NO PRIME OPTION
unknown_shares: 10
# sync_time - the amount of time between syncing the usage information to
disk
# NO PRIME OPTION
sync_time: 1:00:00
Any idea what I need to do?
Thanks,
Christina
--
Christina A. Salls
GLERL Computer Group
help.glerl at noaa.gov
Help Desk x2127
Christina.Salls at noaa.gov
Voice Mail 734-741-2446
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120216/eba41a0d/attachment-0001.html
More information about the torqueusers
mailing list