[torqueusers] Job not running

Jeff Layton laytonjb at att.net
Sun Aug 19 09:08:41 MDT 2012


  André

Apologies for the tardiness of my reply - I was away from my
system for a few weeks. I changed the log_level to 7 as
you indicated and the server logs are below. I just changed
the log_level, restarted the server, fired up a compute node,
and submitted the job. I apologize for the length of the log
file (BTW - this is the server log file). Any help is greatly
appreciated.

Thanks!

Jeff



08/19/2012 10:47:21;0002;PBS_Server;Svr;Log;Log opened
08/19/2012 10:47:21;0006;PBS_Server;Svr;PBS_Server;Server test1 started, 
initialization type = 1
08/19/2012 10:47:21;0002;PBS_Server;Svr;get_default_threads;Defaulting 
min_threads to 9 threads
08/19/2012 10:47:21;0002;PBS_Server;Svr;Act;Account file 
/opt/torque/server_priv/accounting/20120819 opened
08/19/2012 10:47:21;0040;PBS_Server;Req;setup_nodes;setup_nodes()
08/19/2012 10:47:21;0086;PBS_Server;Svr;PBS_Server;Recovered queue batch
08/19/2012 10:47:21;0002;PBS_Server;Svr;PBS_Server;Expected 1, recovered 
1 queues
08/19/2012 10:47:21;0080;PBS_Server;Svr;PBS_Server;2 total files read 
from disk
08/19/2012 10:47:21;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 
0 jobs
08/19/2012 10:47:21;0006;PBS_Server;Svr;PBS_Server;Using ports 
Server:15001  Scheduler:15004  MOM:15002 (server: 't
est1')
08/19/2012 10:47:21;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 
2128, loglevel=0
08/19/2012 10:47:33;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::No route 
to host (113) in tcp_connect_sockaddr, Faile
d when trying to open tcp connection - connect() failed [rc = 15096] 
[addr = 10.1.0.1:15003]
08/19/2012 
10:47:33;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could 
not send mom hierarchy to host
n0001:15003
08/19/2012 10:47:36;0002;PBS_Server;Svr;PBS_Server;Torque Server Version 
= 4.0.2, loglevel = 0
08/19/2012 10:48:03;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::No route 
to host (113) in tcp_connect_sockaddr, Faile
d when trying to open tcp connection - connect() failed [rc = 15096] 
[addr = 10.1.0.1:15003]
08/19/2012 
10:48:03;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could 
not send mom hierarchy to host
n0001:15003
08/19/2012 10:48:33;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::No route 
to host (113) in tcp_connect_sockaddr, Faile
d when trying to open tcp connection - connect() failed [rc = 15096] 
[addr = 10.1.0.1:15003]
08/19/2012 
10:48:33;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could 
not send mom hierarchy to host
n0001:15003
08/19/2012 10:49:03;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::No route 
to host (113) in tcp_connect_sockaddr, Faile
d when trying to open tcp connection - connect() failed [rc = 15096] 
[addr = 10.1.0.1:15003]
08/19/2012 
10:49:03;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could 
not send mom hierarchy to host
n0001:15003
08/19/2012 10:52:37;0002;PBS_Server;Svr;PBS_Server;Torque Server Version 
= 4.0.2, loglevel = 0
08/19/2012 10:57:37;0002;PBS_Server;Svr;PBS_Server;Torque Server Version 
= 4.0.2, loglevel = 0
08/19/2012 11:02:37;0002;PBS_Server;Svr;PBS_Server;Torque Server Version 
= 4.0.2, loglevel = 0
08/19/2012 11:07:48;0002;PBS_Server;Svr;PBS_Server;Torque Server Version 
= 4.0.2, loglevel = 0
08/19/2012 11:12:48;0002;PBS_Server;Svr;PBS_Server;Torque Server Version 
= 4.0.2, loglevel = 0
08/19/2012 11:17:48;0002;PBS_Server;Svr;PBS_Server;Torque Server Version 
= 4.0.2, loglevel = 0
08/19/2012 11:20:06;0004;PBS_Server;Svr;PBS_Server;attributes set:  at 
request of root at test1
08/19/2012 11:20:06;0004;PBS_Server;Svr;PBS_Server;attributes set: 
log_level = 7
08/19/2012 11:20:18;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:20:18;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:20:24;0086;PBS_Server;Svr;PBS_Server;Starting to shutdown 
the server, type is By Signal
08/19/2012 11:20:24;0002;PBS_Server;Svr;PBS_Server;Server shutdown completed
08/19/2012 11:20:24;0002;PBS_Server;Svr;Log;Log closed
08/19/2012 11:20:24;0002;PBS_Server;Svr;Log;Log opened
08/19/2012 11:20:24;0006;PBS_Server;Svr;PBS_Server;Server test1 started, 
initialization type = 1
08/19/2012 11:20:24;0002;PBS_Server;Svr;get_default_threads;Defaulting 
min_threads to 9 threads
08/19/2012 11:20:24;0002;PBS_Server;Svr;Act;Account file 
/opt/torque/server_priv/accounting/20120819 opened
08/19/2012 11:20:24;0040;PBS_Server;Req;setup_nodes;setup_nodes()
08/19/2012 11:20:24;0086;PBS_Server;Svr;PBS_Server;Recovered queue batch
08/19/2012 11:20:24;0002;PBS_Server;Svr;PBS_Server;Expected 1, recovered 
1 queues
08/19/2012 11:20:24;0080;PBS_Server;Svr;PBS_Server;2 total files read 
from disk
08/19/2012 11:20:24;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 
0 jobs
08/19/2012 11:20:24;0006;PBS_Server;Svr;PBS_Server;Using ports 
Server:15001  Scheduler:15004  MOM:15002 (server: 't
est1')
08/19/2012 11:20:24;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 
6383, loglevel=0
08/19/2012 11:20:25;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:20:25;0080;PBS_Server;node;find_nodebyname;locking start 
n0001 in method find_nodebyname-no pos
08/19/2012 11:20:25;0080;PBS_Server;node;find_nodebyname;locking 
complete n0001 in method find_nodebyname
08/19/2012 11:20:25;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:20:25;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:20:25;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:20:25;0080;PBS_Server;node;next_queue;locking batch in 
method next_queue
08/19/2012 11:20:25;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:20:29;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:20:29;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:20:37;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::No route 
to host (113) in tcp_connect_sockaddr, Failed when trying to op
en tcp connection - connect() failed [rc = 15096] [addr = 10.1.0.1:15003]
08/19/2012 
11:20:37;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could 
not send mom hierarchy to host n0001:15003
08/19/2012 11:20:37;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:20:39;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:20:39;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:20:39;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:20:39;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:20:39;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:20:39;0002;PBS_Server;Svr;PBS_Server;Torque Server Version 
= 4.0.2, loglevel = 7
08/19/2012 11:20:39;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:20:54;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:20:54;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:20:54;0080;PBS_Server;node;find_nodebyname;locking start 
n0001 in method find_nodebyname-no pos
08/19/2012 11:20:54;0080;PBS_Server;node;find_nodebyname;locking 
complete n0001 in method find_nodebyname
08/19/2012 11:20:54;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:20:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Connection 
refused (111) in tcp_connect_sockaddr, Failed when trying to
open tcp connection - connect() failed [rc = 15096] [addr = 10.1.0.1:15003]
08/19/2012 
11:20:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could 
not send mom hierarchy to host n0001:15003
08/19/2012 11:20:55;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:21:09;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:21:09;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:21:14;0004;PBS_Server;Svr;svr_is_request;message received 
from sock 8 (version 3)
08/19/2012 11:21:14;0004;PBS_Server;Svr;svr_is_request;message received 
from addr 10.1.0.1:496: mom_port 15002  - rm_port 15003
08/19/2012 11:21:14;0080;PBS_Server;node;svr_is_request;locking start 
n0001 in method svr_is_request-AVL_find
08/19/2012 11:21:14;0080;PBS_Server;node;svr_is_request;locking complete 
n0001 in method svr_is_request
08/19/2012 11:21:14;0004;PBS_Server;Svr;svr_is_request;message STATUS 
(4) received from mom on host n0001 (10.1.0.1:496) (sock 8)
08/19/2012 11:21:14;0004;PBS_Server;Svr;svr_is_request;IS_STATUS 
received from n0001
08/19/2012 11:21:14;0080;PBS_Server;node;svr_is_request;unlocking n0001 
in method svr_is_request-before is_stat_get
08/19/2012 11:21:14;0040;PBS_Server;Req;is_stat_get;received status from 
node n0001
08/19/2012 11:21:14;0080;PBS_Server;node;find_nodebyname;locking start 
n0001 in method find_nodebyname-no pos
08/19/2012 11:21:14;0080;PBS_Server;node;find_nodebyname;locking 
complete n0001 in method find_nodebyname
08/19/2012 11:21:14;0040;PBS_Server;Req;update_node_state;adjusting 
state for node n0001 - state=514, newstate=0
08/19/2012 11:21:14;0040;PBS_Server;Req;update_node_state;node n0001 
marked free
08/19/2012 11:21:14;0080;PBS_Server;node;find_nodebyname;locking start 
n0001 in method find_nodebyname-no pos
08/19/2012 11:21:14;0080;PBS_Server;node;find_nodebyname;locking 
complete n0001 in method find_nodebyname
08/19/2012 11:21:14;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:21:14;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:21:14;0080;PBS_Server;node;svr_is_request;unlocking n0001 
in method svr_is_request-close
08/19/2012 11:21:14;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:21:14;0080;PBS_Server;node;find_nodebyname;locking start 
n0001 in method find_nodebyname-no pos
08/19/2012 11:21:14;0080;PBS_Server;node;find_nodebyname;locking 
complete n0001 in method find_nodebyname
08/19/2012 
11:21:14;0002;PBS_Server;Svr;send_hierarchy_threadtask;Successfully sent 
hierarchy to n0001
08/19/2012 11:21:14;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:21:14;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:21:14;0080;PBS_Server;node;next_queue;locking batch in 
method next_queue
08/19/2012 11:21:14;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:21:24;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:21:24;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:21:29;0004;PBS_Server;Svr;svr_is_request;message received 
from sock 10 (version 3)
08/19/2012 11:21:29;0004;PBS_Server;Svr;svr_is_request;message received 
from addr 10.1.0.1:628: mom_port 15002  - rm_port 15003
08/19/2012 11:21:29;0080;PBS_Server;node;svr_is_request;locking start 
n0001 in method svr_is_request-AVL_find
08/19/2012 11:21:29;0080;PBS_Server;node;svr_is_request;locking complete 
n0001 in method svr_is_request
08/19/2012 11:21:29;0004;PBS_Server;Svr;svr_is_request;message STATUS 
(4) received from mom on host n0001 (10.1.0.1:628) (sock 10)
08/19/2012 11:21:29;0004;PBS_Server;Svr;svr_is_request;IS_STATUS 
received from n0001
08/19/2012 11:21:29;0080;PBS_Server;node;svr_is_request;unlocking n0001 
in method svr_is_request-before is_stat_get
08/19/2012 11:21:29;0040;PBS_Server;Req;is_stat_get;received status from 
node n0001
08/19/2012 11:21:29;0080;PBS_Server;node;find_nodebyname;locking start 
n0001 in method find_nodebyname-no pos
08/19/2012 11:21:29;0080;PBS_Server;node;find_nodebyname;locking 
complete n0001 in method find_nodebyname
08/19/2012 11:21:29;0040;PBS_Server;Req;update_node_state;adjusting 
state for node n0001 - state=512, newstate=0
08/19/2012 11:21:29;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:21:29;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:21:29;0080;PBS_Server;node;find_nodebyname;locking start 
n0001 in method find_nodebyname-no pos
08/19/2012 11:21:29;0080;PBS_Server;node;find_nodebyname;locking 
complete n0001 in method find_nodebyname
08/19/2012 11:21:29;0080;PBS_Server;node;svr_is_request;unlocking n0001 
in method svr_is_request-close
08/19/2012 11:21:39;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:21:39;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:21:46;0080;PBS_Server;Req;dis_request_read;decoding 
command AuthenticateUser from laytonjb
08/19/2012 11:21:46;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from laytonjb at test1, sock=10
08/19/2012 11:21:46;0008;PBS_Server;Job;dispatch_request;dispatching 
request AuthenticateUser on sd=10
08/19/2012 11:21:46;0008;PBS_Server;Job;reply_send_svr;Reply sent for 
request type AuthenticateUser on socket 10
08/19/2012 11:21:46;0080;PBS_Server;Req;dis_request_read;decoding 
command QueueJob from laytonjb
08/19/2012 11:21:46;0100;PBS_Server;Req;;Type QueueJob request received 
from laytonjb at test1, sock=8
08/19/2012 11:21:46;0008;PBS_Server;Job;dispatch_request;dispatching 
request QueueJob on sd=8
08/19/2012 11:21:46;0080;PBS_Server;Req;dis_request_read;decoding 
command Disconnect from laytonjb
08/19/2012 11:21:46;0080;PBS_Server;node;find_queuebyname;locking batch 
in method find_queuebyname
08/19/2012 11:21:46;0040;PBS_Server;Req;node_spec;entered spec=1
08/19/2012 11:21:46;0040;PBS_Server;Req;node_spec;job allocation debug: 
1 requested, 3 svr_clnodes, 1 svr_totnodes
08/19/2012 11:21:46;0080;PBS_Server;node;next_node;locking start n0001 
in method next_node-next != NULL
08/19/2012 11:21:46;0080;PBS_Server;node;next_node;locking complete 
n0001 in method next_node
08/19/2012 11:21:46;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::gpu_count, 
Counted 0 gpus available on node n0001
08/19/2012 11:21:46;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::gpu_count, 
Counted 0 gpus free on node n0001
08/19/2012 11:21:46;0080;PBS_Server;node;node_spec;unlocking n0001 in 
method node_spec-no pos
08/19/2012 11:21:46;0040;PBS_Server;Req;node_spec;job allocation 
debug(3): returning 1 requested
08/19/2012 11:21:46;0008;PBS_Server;Job;reply_send_svr;Reply sent for 
request type QueueJob on socket 8
08/19/2012 11:21:46;0080;PBS_Server;node;req_quejob;unlocking batch in 
method req_quejob-success
08/19/2012 11:21:46;0080;PBS_Server;Req;dis_request_read;decoding 
command JobScript from laytonjb
08/19/2012 11:21:46;0100;PBS_Server;Req;;Type JobScript request received 
from laytonjb at test1, sock=8
08/19/2012 11:21:46;0008;PBS_Server;Job;dispatch_request;dispatching 
request JobScript on sd=8
08/19/2012 11:21:46;0008;PBS_Server;Job;reply_send_svr;Reply sent for 
request type JobScript on socket 8
08/19/2012 11:21:46;0080;PBS_Server;Req;dis_request_read;decoding 
command ReadyToCommit from laytonjb
08/19/2012 11:21:46;0100;PBS_Server;Req;;Type ReadyToCommit request 
received from laytonjb at test1, sock=8
08/19/2012 11:21:46;0008;PBS_Server;Job;dispatch_request;dispatching 
request ReadyToCommit on sd=8
08/19/2012 11:21:46;0008;PBS_Server;Job;20.test1;ready to commit job
08/19/2012 11:21:46;0008;PBS_Server;Job;reply_send_svr;Reply sent for 
request type ReadyToCommit on socket 8
08/19/2012 11:21:46;0008;PBS_Server;Job;20.test1;ready to commit job 
completed
08/19/2012 11:21:46;0080;PBS_Server;Req;dis_request_read;decoding 
command Commit from laytonjb
08/19/2012 11:21:46;0100;PBS_Server;Req;;Type Commit request received 
from laytonjb at test1, sock=8
08/19/2012 11:21:46;0008;PBS_Server;Job;dispatch_request;dispatching 
request Commit on sd=8
08/19/2012 11:21:46;0008;PBS_Server;Job;20.test1;committing job
08/19/2012 11:21:46;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
setting job 20.test1 state from TRANSIT-TRANSICM to QUEUED-QUEUED
(1-10)
08/19/2012 11:21:46;0080;PBS_Server;node;find_queuebyname;locking batch 
in method find_queuebyname
08/19/2012 11:21:46;0100;PBS_Server;Job;20.test1;enqueuing into batch, 
state 1 hop 1
08/19/2012 11:21:46;0080;PBS_Server;node;set_resc_deflt;unlocking batch 
in method set_resc_deflt-no pos
08/19/2012 11:21:46;0080;PBS_Server;node;svr_enquejob;unlocking batch in 
method svr_enquejob-anything
08/19/2012 11:21:46;0080;PBS_Server;node;req_commit;unlocking batch in 
method req_commit-route success
08/19/2012 11:21:46;0008;PBS_Server;Job;reply_send_svr;Reply sent for 
request type Commit on socket 8
08/19/2012 11:21:46;0008;PBS_Server;Job;20.test1;Job Queued at request 
of laytonjb at test1, owner = laytonjb at test1, job name = pbs_test2
, queue = batch
08/19/2012 11:21:46;0080;PBS_Server;Req;dis_request_read;decoding 
command Disconnect from laytonjb
08/19/2012 11:21:48;0080;PBS_Server;Req;dis_request_read;decoding 
command AuthenticateUser from laytonjb
08/19/2012 11:21:48;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from laytonjb at test1, sock=8
08/19/2012 11:21:48;0008;PBS_Server;Job;dispatch_request;dispatching 
request AuthenticateUser on sd=8
08/19/2012 11:21:48;0008;PBS_Server;Job;reply_send_svr;Reply sent for 
request type AuthenticateUser on socket 8
08/19/2012 11:21:48;0080;PBS_Server;Req;dis_request_read;decoding 
command Disconnect from laytonjb
08/19/2012 11:21:48;0080;PBS_Server;Req;dis_request_read;decoding 
command StatusServer from laytonjb
08/19/2012 11:21:48;0100;PBS_Server;Req;;Type StatusServer request 
received from laytonjb at test1, sock=11
08/19/2012 11:21:48;0008;PBS_Server;Job;dispatch_request;dispatching 
request StatusServer on sd=11
08/19/2012 11:21:48;0008;PBS_Server;Job;reply_send_svr;Reply sent for 
request type StatusServer on socket 11
08/19/2012 11:21:48;0080;PBS_Server;Req;dis_request_read;decoding 
command StatusJob from laytonjb
08/19/2012 11:21:48;0100;PBS_Server;Req;;Type StatusJob request received 
from laytonjb at test1, sock=11
08/19/2012 11:21:48;0008;PBS_Server;Job;dispatch_request;dispatching 
request StatusJob on sd=11
08/19/2012 11:21:48;0008;PBS_Server;Job;reply_send_svr;Reply sent for 
request type StatusJob on socket 11
08/19/2012 11:21:48;0002;PBS_Server;Job;req_statjob;Successfully 
returned the status of queued jobs
08/19/2012 11:21:48;0080;PBS_Server;Req;dis_request_read;decoding 
command Disconnect from laytonjb
08/19/2012 11:21:49;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:21:49;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:21:54;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:21:54;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:22:04;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:22:04;0080;PBS_Server;node;next_queue;locking batch in 
method next_queue
08/19/2012 11:22:04;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:22:09;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:22:09;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:22:14;0004;PBS_Server;Svr;svr_is_request;message received 
from sock 10 (version 3)
08/19/2012 11:22:14;0004;PBS_Server;Svr;svr_is_request;message received 
from addr 10.1.0.1:339: mom_port 15002  - rm_port 15003
08/19/2012 11:22:14;0080;PBS_Server;node;svr_is_request;locking start 
n0001 in method svr_is_request-AVL_find
08/19/2012 11:22:14;0080;PBS_Server;node;svr_is_request;locking complete 
n0001 in method svr_is_request
08/19/2012 11:22:14;0004;PBS_Server;Svr;svr_is_request;message STATUS 
(4) received from mom on host n0001 (10.1.0.1:339) (sock 10)
08/19/2012 11:22:14;0004;PBS_Server;Svr;svr_is_request;IS_STATUS 
received from n0001
08/19/2012 11:22:14;0080;PBS_Server;node;svr_is_request;unlocking n0001 
in method svr_is_request-before is_stat_get
08/19/2012 11:22:14;0040;PBS_Server;Req;is_stat_get;received status from 
node n0001
08/19/2012 11:22:14;0080;PBS_Server;node;find_nodebyname;locking start 
n0001 in method find_nodebyname-no pos
08/19/2012 11:22:14;0080;PBS_Server;node;find_nodebyname;locking 
complete n0001 in method find_nodebyname
08/19/2012 11:22:14;0040;PBS_Server;Req;update_node_state;adjusting 
state for node n0001 - state=512, newstate=0
08/19/2012 11:22:14;0080;PBS_Server;node;find_nodebyname;locking start 
n0001 in method find_nodebyname-no pos
08/19/2012 11:22:14;0080;PBS_Server;node;find_nodebyname;locking 
complete n0001 in method find_nodebyname
08/19/2012 11:22:14;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:22:14;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:22:14;0080;PBS_Server;node;svr_is_request;unlocking n0001 
in method svr_is_request-close
08/19/2012 11:22:24;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:22:24;0002;PBS_Server;Svr;work_thread;finished work from 
thread
08/19/2012 11:22:39;0002;PBS_Server;Svr;work_thread;starting work from 
thread
08/19/2012 11:22:39;0002;PBS_Server;Svr;work_thread;finished work from 
thread


> Hi Jeff,
>
> please do a
>
> qmgr -c 'set server log_level = 7'
>
> and try again. Perhaps we can get some more information about the problem then.
> And please, send a qstat -f, not -a. :)
>
> Greetings
> André
>
> ----- Ursprüngliche Mail -----
>> Gus.
>>
>> Thanks for the email! Everything is run by root and was installed
>> by root. I tried your suggestions below to add root to the server
>> manager and operators but that didn't change anything. The jobs
>> still hang and I can't find out why.
>>
>> I'm still trying some things but no joy so far. I think the problem
>> is
>> in the scheduler but I can't seem to locate the problem. It's the
>> simple FIFO scheduler that is part of Torque so I don't see any
>> reason why it's holding jobs. The only thing I can think of is that
>> it doesn't think there are any resources available but I can't
>> find a reason why.
>>
>> Thanks!
>>
>> Jeff
>>
>>
>>



More information about the torqueusers mailing list