[torqueusers] Job not running
Jeff Layton
laytonjb at att.net
Sun Aug 19 09:08:41 MDT 2012
André
Apologies for the tardiness of my reply - I was away from my
system for a few weeks. I changed the log_level to 7 as
you indicated and the server logs are below. I just changed
the log_level, restarted the server, fired up a compute node,
and submitted the job. I apologize for the length of the log
file (BTW - this is the server log file). Any help is greatly
appreciated.
Thanks!
Jeff
08/19/2012 10:47:21;0002;PBS_Server;Svr;Log;Log opened
08/19/2012 10:47:21;0006;PBS_Server;Svr;PBS_Server;Server test1 started,
initialization type = 1
08/19/2012 10:47:21;0002;PBS_Server;Svr;get_default_threads;Defaulting
min_threads to 9 threads
08/19/2012 10:47:21;0002;PBS_Server;Svr;Act;Account file
/opt/torque/server_priv/accounting/20120819 opened
08/19/2012 10:47:21;0040;PBS_Server;Req;setup_nodes;setup_nodes()
08/19/2012 10:47:21;0086;PBS_Server;Svr;PBS_Server;Recovered queue batch
08/19/2012 10:47:21;0002;PBS_Server;Svr;PBS_Server;Expected 1, recovered
1 queues
08/19/2012 10:47:21;0080;PBS_Server;Svr;PBS_Server;2 total files read
from disk
08/19/2012 10:47:21;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered
0 jobs
08/19/2012 10:47:21;0006;PBS_Server;Svr;PBS_Server;Using ports
Server:15001 Scheduler:15004 MOM:15002 (server: 't
est1')
08/19/2012 10:47:21;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid =
2128, loglevel=0
08/19/2012 10:47:33;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::No route
to host (113) in tcp_connect_sockaddr, Faile
d when trying to open tcp connection - connect() failed [rc = 15096]
[addr = 10.1.0.1:15003]
08/19/2012
10:47:33;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could
not send mom hierarchy to host
n0001:15003
08/19/2012 10:47:36;0002;PBS_Server;Svr;PBS_Server;Torque Server Version
= 4.0.2, loglevel = 0
08/19/2012 10:48:03;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::No route
to host (113) in tcp_connect_sockaddr, Faile
d when trying to open tcp connection - connect() failed [rc = 15096]
[addr = 10.1.0.1:15003]
08/19/2012
10:48:03;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could
not send mom hierarchy to host
n0001:15003
08/19/2012 10:48:33;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::No route
to host (113) in tcp_connect_sockaddr, Faile
d when trying to open tcp connection - connect() failed [rc = 15096]
[addr = 10.1.0.1:15003]
08/19/2012
10:48:33;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could
not send mom hierarchy to host
n0001:15003
08/19/2012 10:49:03;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::No route
to host (113) in tcp_connect_sockaddr, Faile
d when trying to open tcp connection - connect() failed [rc = 15096]
[addr = 10.1.0.1:15003]
08/19/2012
10:49:03;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could
not send mom hierarchy to host
n0001:15003
08/19/2012 10:52:37;0002;PBS_Server;Svr;PBS_Server;Torque Server Version
= 4.0.2, loglevel = 0
08/19/2012 10:57:37;0002;PBS_Server;Svr;PBS_Server;Torque Server Version
= 4.0.2, loglevel = 0
08/19/2012 11:02:37;0002;PBS_Server;Svr;PBS_Server;Torque Server Version
= 4.0.2, loglevel = 0
08/19/2012 11:07:48;0002;PBS_Server;Svr;PBS_Server;Torque Server Version
= 4.0.2, loglevel = 0
08/19/2012 11:12:48;0002;PBS_Server;Svr;PBS_Server;Torque Server Version
= 4.0.2, loglevel = 0
08/19/2012 11:17:48;0002;PBS_Server;Svr;PBS_Server;Torque Server Version
= 4.0.2, loglevel = 0
08/19/2012 11:20:06;0004;PBS_Server;Svr;PBS_Server;attributes set: at
request of root at test1
08/19/2012 11:20:06;0004;PBS_Server;Svr;PBS_Server;attributes set:
log_level = 7
08/19/2012 11:20:18;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:20:18;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:20:24;0086;PBS_Server;Svr;PBS_Server;Starting to shutdown
the server, type is By Signal
08/19/2012 11:20:24;0002;PBS_Server;Svr;PBS_Server;Server shutdown completed
08/19/2012 11:20:24;0002;PBS_Server;Svr;Log;Log closed
08/19/2012 11:20:24;0002;PBS_Server;Svr;Log;Log opened
08/19/2012 11:20:24;0006;PBS_Server;Svr;PBS_Server;Server test1 started,
initialization type = 1
08/19/2012 11:20:24;0002;PBS_Server;Svr;get_default_threads;Defaulting
min_threads to 9 threads
08/19/2012 11:20:24;0002;PBS_Server;Svr;Act;Account file
/opt/torque/server_priv/accounting/20120819 opened
08/19/2012 11:20:24;0040;PBS_Server;Req;setup_nodes;setup_nodes()
08/19/2012 11:20:24;0086;PBS_Server;Svr;PBS_Server;Recovered queue batch
08/19/2012 11:20:24;0002;PBS_Server;Svr;PBS_Server;Expected 1, recovered
1 queues
08/19/2012 11:20:24;0080;PBS_Server;Svr;PBS_Server;2 total files read
from disk
08/19/2012 11:20:24;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered
0 jobs
08/19/2012 11:20:24;0006;PBS_Server;Svr;PBS_Server;Using ports
Server:15001 Scheduler:15004 MOM:15002 (server: 't
est1')
08/19/2012 11:20:24;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid =
6383, loglevel=0
08/19/2012 11:20:25;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:20:25;0080;PBS_Server;node;find_nodebyname;locking start
n0001 in method find_nodebyname-no pos
08/19/2012 11:20:25;0080;PBS_Server;node;find_nodebyname;locking
complete n0001 in method find_nodebyname
08/19/2012 11:20:25;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:20:25;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:20:25;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:20:25;0080;PBS_Server;node;next_queue;locking batch in
method next_queue
08/19/2012 11:20:25;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:20:29;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:20:29;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:20:37;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::No route
to host (113) in tcp_connect_sockaddr, Failed when trying to op
en tcp connection - connect() failed [rc = 15096] [addr = 10.1.0.1:15003]
08/19/2012
11:20:37;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could
not send mom hierarchy to host n0001:15003
08/19/2012 11:20:37;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:20:39;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:20:39;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:20:39;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:20:39;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:20:39;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:20:39;0002;PBS_Server;Svr;PBS_Server;Torque Server Version
= 4.0.2, loglevel = 7
08/19/2012 11:20:39;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:20:54;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:20:54;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:20:54;0080;PBS_Server;node;find_nodebyname;locking start
n0001 in method find_nodebyname-no pos
08/19/2012 11:20:54;0080;PBS_Server;node;find_nodebyname;locking
complete n0001 in method find_nodebyname
08/19/2012 11:20:54;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:20:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Connection
refused (111) in tcp_connect_sockaddr, Failed when trying to
open tcp connection - connect() failed [rc = 15096] [addr = 10.1.0.1:15003]
08/19/2012
11:20:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could
not send mom hierarchy to host n0001:15003
08/19/2012 11:20:55;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:21:09;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:21:09;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:21:14;0004;PBS_Server;Svr;svr_is_request;message received
from sock 8 (version 3)
08/19/2012 11:21:14;0004;PBS_Server;Svr;svr_is_request;message received
from addr 10.1.0.1:496: mom_port 15002 - rm_port 15003
08/19/2012 11:21:14;0080;PBS_Server;node;svr_is_request;locking start
n0001 in method svr_is_request-AVL_find
08/19/2012 11:21:14;0080;PBS_Server;node;svr_is_request;locking complete
n0001 in method svr_is_request
08/19/2012 11:21:14;0004;PBS_Server;Svr;svr_is_request;message STATUS
(4) received from mom on host n0001 (10.1.0.1:496) (sock 8)
08/19/2012 11:21:14;0004;PBS_Server;Svr;svr_is_request;IS_STATUS
received from n0001
08/19/2012 11:21:14;0080;PBS_Server;node;svr_is_request;unlocking n0001
in method svr_is_request-before is_stat_get
08/19/2012 11:21:14;0040;PBS_Server;Req;is_stat_get;received status from
node n0001
08/19/2012 11:21:14;0080;PBS_Server;node;find_nodebyname;locking start
n0001 in method find_nodebyname-no pos
08/19/2012 11:21:14;0080;PBS_Server;node;find_nodebyname;locking
complete n0001 in method find_nodebyname
08/19/2012 11:21:14;0040;PBS_Server;Req;update_node_state;adjusting
state for node n0001 - state=514, newstate=0
08/19/2012 11:21:14;0040;PBS_Server;Req;update_node_state;node n0001
marked free
08/19/2012 11:21:14;0080;PBS_Server;node;find_nodebyname;locking start
n0001 in method find_nodebyname-no pos
08/19/2012 11:21:14;0080;PBS_Server;node;find_nodebyname;locking
complete n0001 in method find_nodebyname
08/19/2012 11:21:14;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:21:14;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:21:14;0080;PBS_Server;node;svr_is_request;unlocking n0001
in method svr_is_request-close
08/19/2012 11:21:14;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:21:14;0080;PBS_Server;node;find_nodebyname;locking start
n0001 in method find_nodebyname-no pos
08/19/2012 11:21:14;0080;PBS_Server;node;find_nodebyname;locking
complete n0001 in method find_nodebyname
08/19/2012
11:21:14;0002;PBS_Server;Svr;send_hierarchy_threadtask;Successfully sent
hierarchy to n0001
08/19/2012 11:21:14;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:21:14;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:21:14;0080;PBS_Server;node;next_queue;locking batch in
method next_queue
08/19/2012 11:21:14;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:21:24;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:21:24;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:21:29;0004;PBS_Server;Svr;svr_is_request;message received
from sock 10 (version 3)
08/19/2012 11:21:29;0004;PBS_Server;Svr;svr_is_request;message received
from addr 10.1.0.1:628: mom_port 15002 - rm_port 15003
08/19/2012 11:21:29;0080;PBS_Server;node;svr_is_request;locking start
n0001 in method svr_is_request-AVL_find
08/19/2012 11:21:29;0080;PBS_Server;node;svr_is_request;locking complete
n0001 in method svr_is_request
08/19/2012 11:21:29;0004;PBS_Server;Svr;svr_is_request;message STATUS
(4) received from mom on host n0001 (10.1.0.1:628) (sock 10)
08/19/2012 11:21:29;0004;PBS_Server;Svr;svr_is_request;IS_STATUS
received from n0001
08/19/2012 11:21:29;0080;PBS_Server;node;svr_is_request;unlocking n0001
in method svr_is_request-before is_stat_get
08/19/2012 11:21:29;0040;PBS_Server;Req;is_stat_get;received status from
node n0001
08/19/2012 11:21:29;0080;PBS_Server;node;find_nodebyname;locking start
n0001 in method find_nodebyname-no pos
08/19/2012 11:21:29;0080;PBS_Server;node;find_nodebyname;locking
complete n0001 in method find_nodebyname
08/19/2012 11:21:29;0040;PBS_Server;Req;update_node_state;adjusting
state for node n0001 - state=512, newstate=0
08/19/2012 11:21:29;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:21:29;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:21:29;0080;PBS_Server;node;find_nodebyname;locking start
n0001 in method find_nodebyname-no pos
08/19/2012 11:21:29;0080;PBS_Server;node;find_nodebyname;locking
complete n0001 in method find_nodebyname
08/19/2012 11:21:29;0080;PBS_Server;node;svr_is_request;unlocking n0001
in method svr_is_request-close
08/19/2012 11:21:39;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:21:39;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:21:46;0080;PBS_Server;Req;dis_request_read;decoding
command AuthenticateUser from laytonjb
08/19/2012 11:21:46;0100;PBS_Server;Req;;Type AuthenticateUser request
received from laytonjb at test1, sock=10
08/19/2012 11:21:46;0008;PBS_Server;Job;dispatch_request;dispatching
request AuthenticateUser on sd=10
08/19/2012 11:21:46;0008;PBS_Server;Job;reply_send_svr;Reply sent for
request type AuthenticateUser on socket 10
08/19/2012 11:21:46;0080;PBS_Server;Req;dis_request_read;decoding
command QueueJob from laytonjb
08/19/2012 11:21:46;0100;PBS_Server;Req;;Type QueueJob request received
from laytonjb at test1, sock=8
08/19/2012 11:21:46;0008;PBS_Server;Job;dispatch_request;dispatching
request QueueJob on sd=8
08/19/2012 11:21:46;0080;PBS_Server;Req;dis_request_read;decoding
command Disconnect from laytonjb
08/19/2012 11:21:46;0080;PBS_Server;node;find_queuebyname;locking batch
in method find_queuebyname
08/19/2012 11:21:46;0040;PBS_Server;Req;node_spec;entered spec=1
08/19/2012 11:21:46;0040;PBS_Server;Req;node_spec;job allocation debug:
1 requested, 3 svr_clnodes, 1 svr_totnodes
08/19/2012 11:21:46;0080;PBS_Server;node;next_node;locking start n0001
in method next_node-next != NULL
08/19/2012 11:21:46;0080;PBS_Server;node;next_node;locking complete
n0001 in method next_node
08/19/2012 11:21:46;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::gpu_count,
Counted 0 gpus available on node n0001
08/19/2012 11:21:46;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::gpu_count,
Counted 0 gpus free on node n0001
08/19/2012 11:21:46;0080;PBS_Server;node;node_spec;unlocking n0001 in
method node_spec-no pos
08/19/2012 11:21:46;0040;PBS_Server;Req;node_spec;job allocation
debug(3): returning 1 requested
08/19/2012 11:21:46;0008;PBS_Server;Job;reply_send_svr;Reply sent for
request type QueueJob on socket 8
08/19/2012 11:21:46;0080;PBS_Server;node;req_quejob;unlocking batch in
method req_quejob-success
08/19/2012 11:21:46;0080;PBS_Server;Req;dis_request_read;decoding
command JobScript from laytonjb
08/19/2012 11:21:46;0100;PBS_Server;Req;;Type JobScript request received
from laytonjb at test1, sock=8
08/19/2012 11:21:46;0008;PBS_Server;Job;dispatch_request;dispatching
request JobScript on sd=8
08/19/2012 11:21:46;0008;PBS_Server;Job;reply_send_svr;Reply sent for
request type JobScript on socket 8
08/19/2012 11:21:46;0080;PBS_Server;Req;dis_request_read;decoding
command ReadyToCommit from laytonjb
08/19/2012 11:21:46;0100;PBS_Server;Req;;Type ReadyToCommit request
received from laytonjb at test1, sock=8
08/19/2012 11:21:46;0008;PBS_Server;Job;dispatch_request;dispatching
request ReadyToCommit on sd=8
08/19/2012 11:21:46;0008;PBS_Server;Job;20.test1;ready to commit job
08/19/2012 11:21:46;0008;PBS_Server;Job;reply_send_svr;Reply sent for
request type ReadyToCommit on socket 8
08/19/2012 11:21:46;0008;PBS_Server;Job;20.test1;ready to commit job
completed
08/19/2012 11:21:46;0080;PBS_Server;Req;dis_request_read;decoding
command Commit from laytonjb
08/19/2012 11:21:46;0100;PBS_Server;Req;;Type Commit request received
from laytonjb at test1, sock=8
08/19/2012 11:21:46;0008;PBS_Server;Job;dispatch_request;dispatching
request Commit on sd=8
08/19/2012 11:21:46;0008;PBS_Server;Job;20.test1;committing job
08/19/2012 11:21:46;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
setting job 20.test1 state from TRANSIT-TRANSICM to QUEUED-QUEUED
(1-10)
08/19/2012 11:21:46;0080;PBS_Server;node;find_queuebyname;locking batch
in method find_queuebyname
08/19/2012 11:21:46;0100;PBS_Server;Job;20.test1;enqueuing into batch,
state 1 hop 1
08/19/2012 11:21:46;0080;PBS_Server;node;set_resc_deflt;unlocking batch
in method set_resc_deflt-no pos
08/19/2012 11:21:46;0080;PBS_Server;node;svr_enquejob;unlocking batch in
method svr_enquejob-anything
08/19/2012 11:21:46;0080;PBS_Server;node;req_commit;unlocking batch in
method req_commit-route success
08/19/2012 11:21:46;0008;PBS_Server;Job;reply_send_svr;Reply sent for
request type Commit on socket 8
08/19/2012 11:21:46;0008;PBS_Server;Job;20.test1;Job Queued at request
of laytonjb at test1, owner = laytonjb at test1, job name = pbs_test2
, queue = batch
08/19/2012 11:21:46;0080;PBS_Server;Req;dis_request_read;decoding
command Disconnect from laytonjb
08/19/2012 11:21:48;0080;PBS_Server;Req;dis_request_read;decoding
command AuthenticateUser from laytonjb
08/19/2012 11:21:48;0100;PBS_Server;Req;;Type AuthenticateUser request
received from laytonjb at test1, sock=8
08/19/2012 11:21:48;0008;PBS_Server;Job;dispatch_request;dispatching
request AuthenticateUser on sd=8
08/19/2012 11:21:48;0008;PBS_Server;Job;reply_send_svr;Reply sent for
request type AuthenticateUser on socket 8
08/19/2012 11:21:48;0080;PBS_Server;Req;dis_request_read;decoding
command Disconnect from laytonjb
08/19/2012 11:21:48;0080;PBS_Server;Req;dis_request_read;decoding
command StatusServer from laytonjb
08/19/2012 11:21:48;0100;PBS_Server;Req;;Type StatusServer request
received from laytonjb at test1, sock=11
08/19/2012 11:21:48;0008;PBS_Server;Job;dispatch_request;dispatching
request StatusServer on sd=11
08/19/2012 11:21:48;0008;PBS_Server;Job;reply_send_svr;Reply sent for
request type StatusServer on socket 11
08/19/2012 11:21:48;0080;PBS_Server;Req;dis_request_read;decoding
command StatusJob from laytonjb
08/19/2012 11:21:48;0100;PBS_Server;Req;;Type StatusJob request received
from laytonjb at test1, sock=11
08/19/2012 11:21:48;0008;PBS_Server;Job;dispatch_request;dispatching
request StatusJob on sd=11
08/19/2012 11:21:48;0008;PBS_Server;Job;reply_send_svr;Reply sent for
request type StatusJob on socket 11
08/19/2012 11:21:48;0002;PBS_Server;Job;req_statjob;Successfully
returned the status of queued jobs
08/19/2012 11:21:48;0080;PBS_Server;Req;dis_request_read;decoding
command Disconnect from laytonjb
08/19/2012 11:21:49;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:21:49;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:21:54;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:21:54;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:22:04;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:22:04;0080;PBS_Server;node;next_queue;locking batch in
method next_queue
08/19/2012 11:22:04;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:22:09;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:22:09;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:22:14;0004;PBS_Server;Svr;svr_is_request;message received
from sock 10 (version 3)
08/19/2012 11:22:14;0004;PBS_Server;Svr;svr_is_request;message received
from addr 10.1.0.1:339: mom_port 15002 - rm_port 15003
08/19/2012 11:22:14;0080;PBS_Server;node;svr_is_request;locking start
n0001 in method svr_is_request-AVL_find
08/19/2012 11:22:14;0080;PBS_Server;node;svr_is_request;locking complete
n0001 in method svr_is_request
08/19/2012 11:22:14;0004;PBS_Server;Svr;svr_is_request;message STATUS
(4) received from mom on host n0001 (10.1.0.1:339) (sock 10)
08/19/2012 11:22:14;0004;PBS_Server;Svr;svr_is_request;IS_STATUS
received from n0001
08/19/2012 11:22:14;0080;PBS_Server;node;svr_is_request;unlocking n0001
in method svr_is_request-before is_stat_get
08/19/2012 11:22:14;0040;PBS_Server;Req;is_stat_get;received status from
node n0001
08/19/2012 11:22:14;0080;PBS_Server;node;find_nodebyname;locking start
n0001 in method find_nodebyname-no pos
08/19/2012 11:22:14;0080;PBS_Server;node;find_nodebyname;locking
complete n0001 in method find_nodebyname
08/19/2012 11:22:14;0040;PBS_Server;Req;update_node_state;adjusting
state for node n0001 - state=512, newstate=0
08/19/2012 11:22:14;0080;PBS_Server;node;find_nodebyname;locking start
n0001 in method find_nodebyname-no pos
08/19/2012 11:22:14;0080;PBS_Server;node;find_nodebyname;locking
complete n0001 in method find_nodebyname
08/19/2012 11:22:14;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:22:14;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:22:14;0080;PBS_Server;node;svr_is_request;unlocking n0001
in method svr_is_request-close
08/19/2012 11:22:24;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:22:24;0002;PBS_Server;Svr;work_thread;finished work from
thread
08/19/2012 11:22:39;0002;PBS_Server;Svr;work_thread;starting work from
thread
08/19/2012 11:22:39;0002;PBS_Server;Svr;work_thread;finished work from
thread
> Hi Jeff,
>
> please do a
>
> qmgr -c 'set server log_level = 7'
>
> and try again. Perhaps we can get some more information about the problem then.
> And please, send a qstat -f, not -a. :)
>
> Greetings
> André
>
> ----- Ursprüngliche Mail -----
>> Gus.
>>
>> Thanks for the email! Everything is run by root and was installed
>> by root. I tried your suggestions below to add root to the server
>> manager and operators but that didn't change anything. The jobs
>> still hang and I can't find out why.
>>
>> I'm still trying some things but no joy so far. I think the problem
>> is
>> in the scheduler but I can't seem to locate the problem. It's the
>> simple FIFO scheduler that is part of Torque so I don't see any
>> reason why it's holding jobs. The only thing I can think of is that
>> it doesn't think there are any resources available but I can't
>> find a reason why.
>>
>> Thanks!
>>
>> Jeff
>>
>>
>>
More information about the torqueusers
mailing list