[torqueusers] Odd job reject problem

Tim Miller btmiller at helix.nih.gov
Fri Dec 29 09:40:38 MST 2006


Hi Everyone,

I'm running Torque 2.1.4. I would like all of the nodes and desktop 
computers on our internal network to be able to submit jobs, but only 
some of them are able to and I'm not seeing why.

My setup is simple; a single routing queue that feeds into a single 
execution queue. The queues are configured as follows:

routing:
Queue entry
         queue_type = Route
         total_jobs = 0
         state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 
Exiting:0
         acl_host_enable = False
         resources_default.nodes = 1:xeon306
         mtime = Fri Dec 29 11:19:27 2006
         route_destinations = xeon
         enabled = True
         started = True

exec:
Queue xeon
         queue_type = Execution
         total_jobs = 42
         state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:41 
Exiting:0
         acl_host_enable = False
         from_route_only = True
         mtime = Fri Dec 29 11:19:21 2006
         resources_assigned.nodect = 58
         enabled = True
         started = True

Server setup:
Server <name removed by me>
         server_state = Active
         scheduling = True
         total_jobs = 50
         state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:50 
Exiting:0
         managers = <manager list removed>
         default_queue = entry
         log_events = 511
         mail_from = adm
         query_other_jobs = True
         resources_assigned.nodect = 67
         scheduler_iteration = 600
         node_check_rate = 120
         tcp_timeout = 6
         pbs_version = 2.1.4

As you can see, I've explicit set acl_host_enable to false on both 
queues. Nonetheless, when I try to submit a job from certain hosts I get 
a "job rejected by all possible destinations" and the following in the 
server log:

12/29/2006 11:20:22;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from tim at m3.lobos.nih.gov, sock=10
12/29/2006 11:20:22;0100;PBS_Server;Req;;Type QueueJob request received 
from tim at m3.lobos.nih.gov, sock=9
12/29/2006 11:20:22;0100;PBS_Server;Req;;Type ReadyToCommit request 
received from tim at m3.lobos.nih.gov, sock=9
12/29/2006 11:20:22;0100;PBS_Server;Req;;Type Commit request received 
from tim at m3.lobos.nih.gov, sock=9
12/29/2006 11:20:22;0080;PBS_Server;Req;req_reject;Reject reply 
code=15039(Job rejected by all possible destinations), aux=0, 
type=Commit, from tim at m3.lobos.nih.gov

It looks like the job is never even assigned a number and rejected 
before it even hits the routing queue.

I've scratched my head over this a little and just can't see what I'm 
doing wrong. Any ideas?

Thanks,
Tim

-- 
Tim Miller
Contractor / System Administrator -- Laboratory of Computational Biology
National Institutes of Health   --   Bldg. 50 Rm. 3310  --  301-402-0618


More information about the torqueusers mailing list