[torqueusers] Odd job reject problem

Troy Baer troy at osc.edu
Fri Dec 29 09:42:45 MST 2006


On Fri, 2006-12-29 at 11:40 -0500, Tim Miller wrote:
> I'm running Torque 2.1.4. I would like all of the nodes and desktop 
> computers on our internal network to be able to submit jobs, but only 
> some of them are able to and I'm not seeing why.
> 
> My setup is simple; a single routing queue that feeds into a single 
> execution queue. The queues are configured as follows:
> 
> routing:
> Queue entry
>          queue_type = Route
>          total_jobs = 0
>          state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 
> Exiting:0
>          acl_host_enable = False
>          resources_default.nodes = 1:xeon306
>          mtime = Fri Dec 29 11:19:27 2006
>          route_destinations = xeon
>          enabled = True
>          started = True
> 
> exec:
> Queue xeon
>          queue_type = Execution
>          total_jobs = 42
>          state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:41 
> Exiting:0
>          acl_host_enable = False
>          from_route_only = True
>          mtime = Fri Dec 29 11:19:21 2006
>          resources_assigned.nodect = 58
>          enabled = True
>          started = True
> 
> Server setup:
> Server <name removed by me>
>          server_state = Active
>          scheduling = True
>          total_jobs = 50
>          state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:50 
> Exiting:0
>          managers = <manager list removed>
>          default_queue = entry
>          log_events = 511
>          mail_from = adm
>          query_other_jobs = True
>          resources_assigned.nodect = 67
>          scheduler_iteration = 600
>          node_check_rate = 120
>          tcp_timeout = 6
>          pbs_version = 2.1.4
> 
> As you can see, I've explicit set acl_host_enable to false on both 
> queues. Nonetheless, when I try to submit a job from certain hosts I get 
> a "job rejected by all possible destinations" and the following in the 
> server log:
> 
> 12/29/2006 11:20:22;0100;PBS_Server;Req;;Type AuthenticateUser request 
> received from tim at m3.lobos.nih.gov, sock=10
> 12/29/2006 11:20:22;0100;PBS_Server;Req;;Type QueueJob request received 
> from tim at m3.lobos.nih.gov, sock=9
> 12/29/2006 11:20:22;0100;PBS_Server;Req;;Type ReadyToCommit request 
> received from tim at m3.lobos.nih.gov, sock=9
> 12/29/2006 11:20:22;0100;PBS_Server;Req;;Type Commit request received 
> from tim at m3.lobos.nih.gov, sock=9
> 12/29/2006 11:20:22;0080;PBS_Server;Req;req_reject;Reject reply 
> code=15039(Job rejected by all possible destinations), aux=0, 
> type=Commit, from tim at m3.lobos.nih.gov
> 
> It looks like the job is never even assigned a number and rejected 
> before it even hits the routing queue.
> 
> I've scratched my head over this a little and just can't see what I'm 
> doing wrong. Any ideas?

What does the job look like?  It's hard to say why the job was rejected
without seeing what resources it requested.

	--Troy
-- 
Troy Baer                       troy at osc.edu
Science & Technology Support    http://www.osc.edu/hpc/
Ohio Supercomputer Center       614-292-9701



More information about the torqueusers mailing list