[torqueusers] jobs stuck in queue until I force execution with qrun

Gustavo Correa gus at ldeo.columbia.edu
Thu Feb 16 14:05:30 MST 2012


PS - For some diagnostic, you could also try '$TORQUE/bin/pbsnodes' on the server,
and '$TORQUE/sbin/momctl -d 3'  on the compute nodes.
Gus Correa

On Feb 16, 2012, at 3:55 PM, Gustavo Correa wrote:

> Hi Christina
> 
> This is just a vague thought, not sure if in the right direction.
> 
> I am a bit confused about the domain being admin.default.domain
> Is this the sever name in $TORQUE/server_name on the head node?
> Is it something else, perhaps the head node FQDN Internet address?
> 
> How about this line in the compute nodes'  $TORQUE/mom_priv/config file:
> $pbsserver .....
> What is the server name that appears there?
> 
> These items were a source of confusion for me long ago. 
> I don't even remember anymore
> what was the mistake and how it was fixed, but maybe there is something here.
> 
> Also, is there any hint of the problem in the $TORQUE/mom_logs files in the compute nodes?
> How about the /var/log/messages on the compute nodes, any smoking gun there?
> 
> Can the compute nodes resolve the Torque server name [easy way via /etc/hosts]?
> Can the Torque server resolve the compute nodes' names [ say in /etc/hosts]?
> Is there a firewall between the server and the compute nodes?
> 
> Maybe the Torque Admin Guide, Ch. 1 [overview/installation/configuration] 
> and Ch 11 [troubleshooting] can help:
> 
> http://www.adaptivecomputing.com/resources/docs/
> 
> I hope this helps,
> Gus Correa
> 
> On Feb 16, 2012, at 3:10 PM, Christina Salls wrote:
> 
>> Hi all,
>> 
>>         My situation has improved but I am still not there.  I can submit a job successfully, but it will stay in the queue until I force execution with qrun. 
>> 
>> eg.
>> 
>> -bash-4.1$ qsub ./example_submit_script_1
>> 22.admin.default.domain
>> -bash-4.1$ qstat -a
>> 
>> admin.default.domain: 
>>                                                                         Req'd  Req'd   Elap
>> Job ID               Username Queue    Jobname          SessID NDS   TSK Memory Time  S Time
>> -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
>> 22.admin.default     salls    batch    ExampleJob          --      1   1    --  00:01 Q   -- 
>> 
>> .[root at wings ~]# qrun 22
>> [root at wings ~]# qstat -a
>> 
>> admin.default.domain: 
>>                                                                         Req'd  Req'd   Elap
>> Job ID               Username Queue    Jobname          SessID NDS   TSK Memory Time  S Time
>> -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
>> 22.admin.default     salls    batch    ExampleJob        30429     1   1    --  00:01 R   -- 
>> 
>> [root at wings ~]# qstat -a
>> 
>> admin.default.domain: 
>>                                                                         Req'd  Req'd   Elap
>> Job ID               Username Queue    Jobname          SessID NDS   TSK Memory Time  S Time
>> -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
>> 22.admin.default     salls    batch    ExampleJob        30429     1   1    --  00:01 C 00:00
>> [root at wings ~]# 
>> 
>> 
>> This is what tracejob output looks like:
>> 
>> [root at wings ~]# tracejob 22
>> /var/spool/torque/mom_logs/20120216: No such file or directory
>> /var/spool/torque/sched_logs/20120216: No matching job records located
>> 
>> Job: 22.admin.default.domain
>> 
>> 02/16/2012 13:46:51  S    enqueuing into batch, state 1 hop 1
>> 02/16/2012 13:46:51  S    Job Queued at request of salls at admin.default.domain, owner = salls at admin.default.domain,
>>                          job name = ExampleJob, queue = batch
>> 02/16/2012 13:46:51  A    queue=batch
>> 02/16/2012 13:53:53  S    Job Run at request of root at admin.default.domain
>> 02/16/2012 13:53:53  S    Not sending email: User does not want mail of this type.
>> 02/16/2012 13:53:53  A    user=salls group=man jobname=ExampleJob queue=batch ctime=1329421611 qtime=1329421611
>>                          etime=1329421611 start=1329422033 owner=salls at admin.default.domain
>>                          exec_host=n001.default.domain/0 Resource_List.neednodes=1 Resource_List.nodect=1
>>                          Resource_List.nodes=1 Resource_List.walltime=00:01:00 
>> 02/16/2012 13:54:03  S    Not sending email: User does not want mail of this type.
>> 02/16/2012 13:54:03  S    Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb
>>                          resources_used.walltime=00:00:10
>> 02/16/2012 13:54:03  A    user=salls group=man jobname=ExampleJob queue=batch ctime=1329421611 qtime=1329421611
>>                          etime=1329421611 start=1329422033 owner=salls at admin.default.domain
>>                          exec_host=n001.default.domain/0 Resource_List.neednodes=1 Resource_List.nodect=1
>>                          Resource_List.nodes=1 Resource_List.walltime=00:01:00 session=30429 end=1329422043
>>                          Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb
>>                          resources_used.walltime=00:00:10
>> 
>> 
>> This is what the output files look like:
>> 
>> -bash-4.1$ more ExampleJob.o22
>> Thu Feb 16 13:53:53 CST 2012
>> Thu Feb 16 13:54:03 CST 2012
>> -bash-4.1$ more ExampleJob.e22
>> -bash-4.1$ 
>> 
>> This is my basic server config:
>> 
>> [root at wings ~]# qmgr
>> Max open servers: 10239
>> Qmgr: print server
>> #
>> # Create queues and set their attributes.
>> #
>> #
>> # Create and define queue batch
>> #
>> create queue batch
>> set queue batch queue_type = Execution
>> set queue batch resources_default.nodes = 1
>> set queue batch resources_default.walltime = 01:00:00
>> set queue batch enabled = True
>> set queue batch started = True
>> #
>> # Set server attributes.
>> #
>> set server scheduling = True
>> set server acl_hosts = admin.default.domain
>> set server acl_hosts += wings.glerl.noaa.gov
>> set server managers = root at wings.glerl.noaa.gov
>> set server managers += salls at wings.glerl.noaa.gov
>> set server operators = root at wings.glerl.noaa.gov
>> set server operators += salls at wings.glerl.noaa.gov
>> set server default_queue = batch
>> set server log_events = 511
>> set server mail_from = adm
>> set server scheduler_iteration = 600
>> set server node_check_rate = 150
>> set server tcp_timeout = 6
>> set server mom_job_sync = True
>> set server keep_completed = 300
>> set server next_job_number = 23
>> 
>> Processes running on server:
>> 
>> root     32086     1  0 13:23 ?        00:00:00 /usr/local/sbin/pbs_server -d /var/spool/torque -H admin.default.domain
>> root     32173     1  0 13:23 ?        00:00:00 /usr/local/sbin/pbs_sched -d /var/spool/torque
>> 
>> 
>> My sched_config file looks like this.  I left the default values as is.
>> 
>> [root at wings sched_priv]# more sched_config
>> 
>> 
>> # This is the config file for the scheduling policy
>> # FORMAT:  option: value prime_option
>> #	option 		- the name of what we are changing defined in config.h
>> #	value  		- can be boolean/string/numeric depending on the option
>> #	prime_option	- can be prime/non_prime/all ONLY FOR SOME OPTIONS
>> 
>> # Round Robin - 
>> #	run a job from each queue before running second job from the
>> #	first queue.
>> 
>> round_robin: False	all
>> 
>> 
>> # By Queue - 
>> #	run jobs by queues.
>> #       If it is not set, the scheduler will look at all the jobs on
>> #       on the server as one large queue, and ignore the queues set
>> #       by the administrator
>> #	PRIME OPTION
>> 
>> by_queue: True		prime
>> by_queue: True		non_prime
>> 
>> 
>> # Strict Fifo - 
>> #	run jobs in strict fifo order.  If one job can not run
>> #	move onto the next queue and do not run any more jobs
>> #	out of that queue even if some jobs in the queue could
>> #	be run.
>> #	If it is not set, it could very easily starve the large
>> #	resource using jobs.
>> #	PRIME OPTION
>> 
>> strict_fifo: false	ALL
>> 
>> #
>> # fair_share - schedule jobs based on usage and share values
>> #	PRIME OPTION
>> #
>> fair_share: false	ALL
>> 
>> # Help Starving Jobs - 
>> #	Jobs which have been waiting a long time will
>> #	be considered starving.  Once a job is considered
>> #	starving, the scheduler will not run any jobs 
>> #	until it can run all of the starving jobs.  
>> #	PRIME OPTION
>> 
>> help_starving_jobs	true	ALL
>> 
>> #
>> # sort_queues - sort queues by the priority attribute
>> #	PRIME OPTION
>> #
>> sort_queues	true	ALL
>> 
>> #
>> # load_balancing - load balance between timesharing nodes
>> #	PRIME OPTION
>> #
>> load_balancing: false	ALL
>> 
>> # sort_by:
>> # key:
>> # 	to sort the jobs on one key, specify it by sort_by
>> #	If multiple sorts are necessary, set sory_by to multi_sort
>> # 	specify the keys in order of sorting
>> 
>> # if round_robin or by_queue is set, the jobs will be sorted in their
>> # respective queues.  If not the entire server will be sorted.
>> 
>> # different sorts - defined in globals.c
>> # no_sort shortest_job_first longest_job_first smallest_memory_first 
>> # largest_memory_first high_priority_first low_priority_first multi_sort
>> # fair_share large_walltime_first short_walltime_first
>> #
>> #	PRIME OPTION
>> sort_by: shortest_job_first	ALL
>> 
>> # filter out prolific debug messages
>> # 256 are DEBUG2 messages 
>> #	NO PRIME OPTION
>> log_filter: 256
>> 
>> # all queues starting with this value are dedicated time queues
>> # i.e. dedtime or dedicatedtime would be dedtime queues
>> #	NO PRIME OPTION
>> dedicated_prefix: ded
>> 
>> # ignored queues
>> # you can specify up to 16 queues to be ignored by the scheduler
>> #ignore_queue: queue_name
>> 
>> # this defines how long before a job is considered starving.  If a job has 
>> # been queued for this long, it will be considered starving
>> #	NO PRIME OPTION
>> max_starve: 24:00:00
>> 
>> # The following three config values are meaningless with fair share turned off
>> 
>> # half_life - the half life of usage for fair share
>> #	NO PRIME OPTION
>> half_life: 24:00:00
>> 
>> # unknown_shares - the number of shares for the "unknown" group
>> #	NO PRIME OPTION
>> unknown_shares: 10
>> 
>> # sync_time - the amount of time between syncing the usage information to disk
>> #	NO PRIME OPTION
>> sync_time: 1:00:00
>> 
>> 
>> Any idea what I need to do?
>> 
>> Thanks,
>> 
>>      Christina
>> 
>> 
>> -- 
>> Christina A. Salls
>> GLERL Computer Group
>> help.glerl at noaa.gov
>> Help Desk x2127
>> Christina.Salls at noaa.gov
>> Voice Mail 734-741-2446 
>> 
>> 
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list