[torqueusers] jobs stuck in queue until I force execution with qrun
Gustavo Correa
gus at ldeo.columbia.edu
Thu Feb 16 17:51:34 MST 2012
Hi Christina
On Feb 16, 2012, at 5:04 PM, Christina Salls wrote:
>
>
> On Thu, Feb 16, 2012 at 3:55 PM, Gustavo Correa <gus at ldeo.columbia.edu> wrote:
> Hi Christina
>
> This is just a vague thought, not sure if in the right direction.
>
> I am a bit confused about the domain being admin.default.domain
> Is this the sever name in $TORQUE/server_name on the head node?
>
> Yes, this is the name server's second interface, on the private network to the compute nodes, and it is the name in the Torque/server_name file on the head node and compute nodes.
>
> [root at wings torque]# more server_name
> admin.default.domain
> [root at n001 torque]# more server_name
> admin.default.domain
>
> Is it something else, perhaps the head node FQDN Internet address?
>
>
>
> How about this line in the compute nodes' $TORQUE/mom_priv/config file:
> $pbsserver .....
> What is the server name that appears there?
>
> oh oh!! There is no /var/spool/torque/mom_priv/config file!! What should that look like?
>
Something typical:
$pbsserver name_of_server_in_the_local_subnet [probably 'admin' for you]
$usecp *:/home /home
[the second line is for shared / NFS mounted directories, to copy files with cp rather
than scp, one line per directory/filesystem]
> These items were a source of confusion for me long ago.
> I don't even remember anymore
> what was the mistake and how it was fixed, but maybe there is something here.
>
> Also, is there any hint of the problem in the $TORQUE/mom_logs files in the compute nodes?
>
> 02/16/2012 13:23:18;0002; pbs_mom;Svr;im_eof;End of File from addr 10.0.10.1:15001
> 02/16/2012 13:23:18;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server admin.default.domain
> 02/16/2012 13:23:29;0001; pbs_mom;Svr;mom_server_valid_message_source;duplicate connection from 10.0.10.1:1023 - cl
> osing original connection
> 02/16/2012 13:24:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0
> 02/16/2012 13:29:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0
> 02/16/2012 13:34:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0
> 02/16/2012 13:39:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0
> 02/16/2012 13:44:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0
> 02/16/2012 13:49:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0
> 02/16/2012 13:53:53;0001; pbs_mom;Job;TMomFinalizeJob3;job 22.admin.default.domain started, pid = 30429
> 02/16/2012 13:54:03;0080; pbs_mom;Job;22.admin.default.domain;scan_for_terminated: job 22.admin.default.domain task
> 1 terminated, sid=30429
> 02/16/2012 13:54:03;0008; pbs_mom;Job;22.admin.default.domain;job was terminated
> 02/16/2012 13:54:03;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
> 02/16/2012 13:54:03;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop
> 02/16/2012 13:54:03;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
> 02/16/2012 13:54:03;0080; pbs_mom;Job;22.admin.default.domain;obit sent to server
> 02/16/2012 13:54:03;0080; pbs_mom;Job;22.admin.default.domain;removed job script
> 02/16/2012 13:54:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0
> 02/16/2012 13:59:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0
> 02/16/2012 14:04:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0
> 02/16/2012 14:09:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0
>
> How about the /var/log/messages on the compute nodes, any smoking gun there?
>
> AHA!! This might be a clue!!
>
> Feb 16 03:06:03 n001 rpc.idmapd[2776]: nss_getpwnam: name 'root at glerl.noaa.gov' does not map into domain 'default.domain'
> Feb 16 10:14:54 n001 rpc.idmapd[2776]: nss_getpwnam: name 'root at glerl.noaa.gov' does not map into domain 'default.domain'
> Feb 16 15:49:23 n001 rpc.idmapd[2776]: nss_getpwnam: name 'root at glerl.noaa.gov' does not map into domain 'default.domain'
> [root at n001 mom_logs]#
>
>
Here I use the FQDN as the server name in $TORQUE/server_name on the head node
[but in mom_priv/config the server local name].
My guess is that the password issue above may be because you are using the server
local name in server_name instead.
May be worth trying glerl.nooa.gov as server name, at least for a test, restart everything, try ...
> Can the compute nodes resolve the Torque server name [easy way via /etc/hosts]?
>
> yes
>
> From /etc/hosts file
>
> # Management Entries
>
> 10.0.10.1 admin.default.domain admin loghost
> 192.168.20.1 admin-ib.default.domain admin-ib loghost-ib
>
> # Ethernet Node Entries
>
> 10.0.1.1 n001.default.domain n001
> 10.0.1.2 n002.default.domain n002
> 10.0.1.3 n003.default.domain n003
> .........
> Can the Torque server resolve the compute nodes' names [ say in /etc/hosts]?
>
> yes
Sure, /etc/hosts looks right.
>
> From the /etc/hosts file on the server
>
> # Management Entries
>
> 10.0.10.1 admin.default.domain admin loghost
> 192.168.20.1 admin-ib.default.domain admin-ib loghost-ib
>
> # Ethernet Node Entries
>
> 10.0.1.1 n001.default.domain n001
> 10.0.1.2 n002.default.domain n002
> 10.0.1.3 n003.default.domain n003
> 10.0.1.4 n004.default.domain n004
>
Again looks right
>
> Is there a firewall between the server and the compute nodes?
>
> no firewall enabled.
>
Except between the server and the Internet, I presume. :)
[NOAA asks me tons of RSA tokens and passwords to get to any computer there ... :) ]
> Maybe the Torque Admin Guide, Ch. 1 [overview/installation/configuration]
> and Ch 11 [troubleshooting] can help:
>
> http://www.adaptivecomputing.com/resources/docs/
>
> I hope this helps,
>
> Thanks Gus!! I will review the Admin Guide. It is what I used to do the setup but I have been changing things right and left!
> I have also read the troubleshooting guide to no avail. Back to the drawing board.
>
That is not perfect, but it is quite useful documentation.
Gus Correa
> Gus Correa
>
> On Feb 16, 2012, at 3:10 PM, Christina Salls wrote:
>
> > Hi all,
> >
> > My situation has improved but I am still not there. I can submit a job successfully, but it will stay in the queue until I force execution with qrun.
> >
> > eg.
> >
> > -bash-4.1$ qsub ./example_submit_script_1
> > 22.admin.default.domain
> > -bash-4.1$ qstat -a
> >
> > admin.default.domain:
> > Req'd Req'd Elap
> > Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
> > -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
> > 22.admin.default salls batch ExampleJob -- 1 1 -- 00:01 Q --
> >
> > .[root at wings ~]# qrun 22
> > [root at wings ~]# qstat -a
> >
> > admin.default.domain:
> > Req'd Req'd Elap
> > Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
> > -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
> > 22.admin.default salls batch ExampleJob 30429 1 1 -- 00:01 R --
> >
> > [root at wings ~]# qstat -a
> >
> > admin.default.domain:
> > Req'd Req'd Elap
> > Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
> > -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
> > 22.admin.default salls batch ExampleJob 30429 1 1 -- 00:01 C 00:00
> > [root at wings ~]#
> >
> >
> > This is what tracejob output looks like:
> >
> > [root at wings ~]# tracejob 22
> > /var/spool/torque/mom_logs/20120216: No such file or directory
> > /var/spool/torque/sched_logs/20120216: No matching job records located
> >
> > Job: 22.admin.default.domain
> >
> > 02/16/2012 13:46:51 S enqueuing into batch, state 1 hop 1
> > 02/16/2012 13:46:51 S Job Queued at request of salls at admin.default.domain, owner = salls at admin.default.domain,
> > job name = ExampleJob, queue = batch
> > 02/16/2012 13:46:51 A queue=batch
> > 02/16/2012 13:53:53 S Job Run at request of root at admin.default.domain
> > 02/16/2012 13:53:53 S Not sending email: User does not want mail of this type.
> > 02/16/2012 13:53:53 A user=salls group=man jobname=ExampleJob queue=batch ctime=1329421611 qtime=1329421611
> > etime=1329421611 start=1329422033 owner=salls at admin.default.domain
> > exec_host=n001.default.domain/0 Resource_List.neednodes=1 Resource_List.nodect=1
> > Resource_List.nodes=1 Resource_List.walltime=00:01:00
> > 02/16/2012 13:54:03 S Not sending email: User does not want mail of this type.
> > 02/16/2012 13:54:03 S Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb
> > resources_used.walltime=00:00:10
> > 02/16/2012 13:54:03 A user=salls group=man jobname=ExampleJob queue=batch ctime=1329421611 qtime=1329421611
> > etime=1329421611 start=1329422033 owner=salls at admin.default.domain
> > exec_host=n001.default.domain/0 Resource_List.neednodes=1 Resource_List.nodect=1
> > Resource_List.nodes=1 Resource_List.walltime=00:01:00 session=30429 end=1329422043
> > Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb
> > resources_used.walltime=00:00:10
> >
> >
> > This is what the output files look like:
> >
> > -bash-4.1$ more ExampleJob.o22
> > Thu Feb 16 13:53:53 CST 2012
> > Thu Feb 16 13:54:03 CST 2012
> > -bash-4.1$ more ExampleJob.e22
> > -bash-4.1$
> >
> > This is my basic server config:
> >
> > [root at wings ~]# qmgr
> > Max open servers: 10239
> > Qmgr: print server
> > #
> > # Create queues and set their attributes.
> > #
> > #
> > # Create and define queue batch
> > #
> > create queue batch
> > set queue batch queue_type = Execution
> > set queue batch resources_default.nodes = 1
> > set queue batch resources_default.walltime = 01:00:00
> > set queue batch enabled = True
> > set queue batch started = True
> > #
> > # Set server attributes.
> > #
> > set server scheduling = True
> > set server acl_hosts = admin.default.domain
> > set server acl_hosts += wings.glerl.noaa.gov
> > set server managers = root at wings.glerl.noaa.gov
> > set server managers += salls at wings.glerl.noaa.gov
> > set server operators = root at wings.glerl.noaa.gov
> > set server operators += salls at wings.glerl.noaa.gov
> > set server default_queue = batch
> > set server log_events = 511
> > set server mail_from = adm
> > set server scheduler_iteration = 600
> > set server node_check_rate = 150
> > set server tcp_timeout = 6
> > set server mom_job_sync = True
> > set server keep_completed = 300
> > set server next_job_number = 23
> >
> > Processes running on server:
> >
> > root 32086 1 0 13:23 ? 00:00:00 /usr/local/sbin/pbs_server -d /var/spool/torque -H admin.default.domain
> > root 32173 1 0 13:23 ? 00:00:00 /usr/local/sbin/pbs_sched -d /var/spool/torque
> >
> >
> > My sched_config file looks like this. I left the default values as is.
> >
> > [root at wings sched_priv]# more sched_config
> >
> >
> > # This is the config file for the scheduling policy
> > # FORMAT: option: value prime_option
> > # option - the name of what we are changing defined in config.h
> > # value - can be boolean/string/numeric depending on the option
> > # prime_option - can be prime/non_prime/all ONLY FOR SOME OPTIONS
> >
> > # Round Robin -
> > # run a job from each queue before running second job from the
> > # first queue.
> >
> > round_robin: False all
> >
> >
> > # By Queue -
> > # run jobs by queues.
> > # If it is not set, the scheduler will look at all the jobs on
> > # on the server as one large queue, and ignore the queues set
> > # by the administrator
> > # PRIME OPTION
> >
> > by_queue: True prime
> > by_queue: True non_prime
> >
> >
> > # Strict Fifo -
> > # run jobs in strict fifo order. If one job can not run
> > # move onto the next queue and do not run any more jobs
> > # out of that queue even if some jobs in the queue could
> > # be run.
> > # If it is not set, it could very easily starve the large
> > # resource using jobs.
> > # PRIME OPTION
> >
> > strict_fifo: false ALL
> >
> > #
> > # fair_share - schedule jobs based on usage and share values
> > # PRIME OPTION
> > #
> > fair_share: false ALL
> >
> > # Help Starving Jobs -
> > # Jobs which have been waiting a long time will
> > # be considered starving. Once a job is considered
> > # starving, the scheduler will not run any jobs
> > # until it can run all of the starving jobs.
> > # PRIME OPTION
> >
> > help_starving_jobs true ALL
> >
> > #
> > # sort_queues - sort queues by the priority attribute
> > # PRIME OPTION
> > #
> > sort_queues true ALL
> >
> > #
> > # load_balancing - load balance between timesharing nodes
> > # PRIME OPTION
> > #
> > load_balancing: false ALL
> >
> > # sort_by:
> > # key:
> > # to sort the jobs on one key, specify it by sort_by
> > # If multiple sorts are necessary, set sory_by to multi_sort
> > # specify the keys in order of sorting
> >
> > # if round_robin or by_queue is set, the jobs will be sorted in their
> > # respective queues. If not the entire server will be sorted.
> >
> > # different sorts - defined in globals.c
> > # no_sort shortest_job_first longest_job_first smallest_memory_first
> > # largest_memory_first high_priority_first low_priority_first multi_sort
> > # fair_share large_walltime_first short_walltime_first
> > #
> > # PRIME OPTION
> > sort_by: shortest_job_first ALL
> >
> > # filter out prolific debug messages
> > # 256 are DEBUG2 messages
> > # NO PRIME OPTION
> > log_filter: 256
> >
> > # all queues starting with this value are dedicated time queues
> > # i.e. dedtime or dedicatedtime would be dedtime queues
> > # NO PRIME OPTION
> > dedicated_prefix: ded
> >
> > # ignored queues
> > # you can specify up to 16 queues to be ignored by the scheduler
> > #ignore_queue: queue_name
> >
> > # this defines how long before a job is considered starving. If a job has
> > # been queued for this long, it will be considered starving
> > # NO PRIME OPTION
> > max_starve: 24:00:00
> >
> > # The following three config values are meaningless with fair share turned off
> >
> > # half_life - the half life of usage for fair share
> > # NO PRIME OPTION
> > half_life: 24:00:00
> >
> > # unknown_shares - the number of shares for the "unknown" group
> > # NO PRIME OPTION
> > unknown_shares: 10
> >
> > # sync_time - the amount of time between syncing the usage information to disk
> > # NO PRIME OPTION
> > sync_time: 1:00:00
> >
> >
> > Any idea what I need to do?
> >
> > Thanks,
> >
> > Christina
> >
> >
> > --
> > Christina A. Salls
> > GLERL Computer Group
> > help.glerl at noaa.gov
> > Help Desk x2127
> > Christina.Salls at noaa.gov
> > Voice Mail 734-741-2446
> >
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
> --
> Christina A. Salls
> GLERL Computer Group
> help.glerl at noaa.gov
> Help Desk x2127
> Christina.Salls at noaa.gov
> Voice Mail 734-741-2446
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list