[torqueusers] Torque behaving badly

Jagga Soorma jagga13 at gmail.com
Thu Nov 14 20:18:18 MST 2013


I changed the log level and here is what I see on the server:

Looks like it is intermittently having issues connecting to port 15002 on
the client.  This client was just fine under the 2.5.9 torque production
environment that we have but seems to be intermittently having issues in
the 2.5.13 test environment that is setup with gpu support.

--
server
..snip..
11/14/2013 19:15:20;0080;PBS_Server;Req;dis_request_read;decoding command
QueueJob from user
11/14/2013 19:15:20;0100;PBS_Server;Req;;Type QueueJob request received
from user at server1.xxx.com, sock=13
11/14/2013 19:15:20;0008;PBS_Server;Job;dispatch_request;dispatching
request QueueJob on sd=13
11/14/2013 19:15:20;0008;PBS_Server;Job;reply_send;Reply sent for request
type QueueJob on socket 13
11/14/2013 19:15:20;0080;PBS_Server;Req;dis_request_read;decoding command
ReadyToCommit from user
11/14/2013 19:15:20;0100;PBS_Server;Req;;Type ReadyToCommit request
received from user at server1.xxx.com, sock=13
11/14/2013 19:15:20;0008;PBS_Server;Job;dispatch_request;dispatching
request ReadyToCommit on sd=13
11/14/2013 19:15:20;0008;PBS_Server;Job;7352.server1.xxx.com;ready to
commit job
11/14/2013 19:15:20;0008;PBS_Server;Job;reply_send;Reply sent for request
type ReadyToCommit on socket 13
11/14/2013 19:15:20;0008;PBS_Server;Job;7352.server1.xxx.com;ready to
commit job completed
11/14/2013 19:15:20;0080;PBS_Server;Req;dis_request_read;decoding command
Commit from user
11/14/2013 19:15:20;0100;PBS_Server;Req;;Type Commit request received from
user at server1.xxx.com, sock=13
11/14/2013 19:15:20;0008;PBS_Server;Job;dispatch_request;dispatching
request Commit on sd=13
11/14/2013 19:15:20;0008;PBS_Server;Job;7352.server1.xxx.com;committing job
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting
job 7352.server1.xxx.com state from TRANSIT-TRANSICM to QUEUED-QUEUED (1-10)
11/14/2013 19:15:20;0100;PBS_Server;Job;7352.server1.xxx.com;enqueuing into
batch, state 1 hop 1
11/14/2013 19:15:20;0008;PBS_Server;Job;reply_send;Reply sent for request
type Commit on socket 13
11/14/2013 19:15:20;0008;PBS_Server;Job;7352.server1.xxx.com;Reply sent for
request type Commit on socket 13
11/14/2013 19:15:20;0040;PBS_Server;Svr;server1.xxx.com;Scheduler was sent
the command new
11/14/2013 19:15:20;0080;PBS_Server;Req;dis_request_read;decoding command
StatusServer from Scheduler
11/14/2013 19:15:20;0100;PBS_Server;Req;;Type StatusServer request received
from Scheduler at server1.xxx.com, sock=14
11/14/2013 19:15:20;0008;PBS_Server;Job;dispatch_request;dispatching
request StatusServer on sd=14
11/14/2013 19:15:20;0008;PBS_Server;Job;reply_send;Reply sent for request
type StatusServer on socket 14
11/14/2013 19:15:20;0080;PBS_Server;Req;dis_request_read;decoding command
StatusNode from Scheduler
11/14/2013 19:15:20;0100;PBS_Server;Req;;Type StatusNode request received
from Scheduler at server1.xxx.com, sock=14
11/14/2013 19:15:20;0008;PBS_Server;Job;dispatch_request;dispatching
request StatusNode on sd=14
11/14/2013 19:15:20;0040;PBS_Server;Req;req_stat_node;entered
11/14/2013 19:15:20;0008;PBS_Server;Job;reply_send;Reply sent for request
type StatusNode on socket 14
11/14/2013 19:15:20;0080;PBS_Server;Req;dis_request_read;decoding command
Disconnect from user
11/14/2013 19:15:20;0080;PBS_Server;Req;dis_request_read;decoding command
StatusQueue from Scheduler
11/14/2013 19:15:20;0100;PBS_Server;Req;;Type StatusQueue request received
from Scheduler at server1.xxx.com, sock=14
11/14/2013 19:15:20;0008;PBS_Server;Job;dispatch_request;dispatching
request StatusQueue on sd=14
11/14/2013 19:15:20;0008;PBS_Server;Job;reply_send;Reply sent for request
type StatusQueue on socket 14
11/14/2013 19:15:20;0080;PBS_Server;Req;dis_request_read;decoding command
SelStat from Scheduler
11/14/2013 19:15:20;0100;PBS_Server;Req;;Type SelStat request received from
Scheduler at server1.xxx.com, sock=14
11/14/2013 19:15:20;0008;PBS_Server;Job;dispatch_request;dispatching
request SelStat on sd=14
11/14/2013 19:15:20;0008;PBS_Server;Job;reply_send;Reply sent for request
type SelStat on socket 14
11/14/2013 19:15:20;0080;PBS_Server;Req;dis_request_read;decoding command
ResourceQuery from Scheduler
11/14/2013 19:15:20;0100;PBS_Server;Req;;Type ResourceQuery request
received from Scheduler at server1.xxx.com, sock=14
11/14/2013 19:15:20;0008;PBS_Server;Job;dispatch_request;dispatching
request ResourceQuery on sd=14
11/14/2013 19:15:20;0040;PBS_Server;Req;node_spec;entered spec=1
11/14/2013 19:15:20;0040;PBS_Server;Req;node_spec;job allocation debug: 1
requested, 16 svr_clnodes, 1 svr_totnodes
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::gpu_count,
Counted 3 gpus available on node node1
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::gpu_count,
Counted 3 gpus free on node node1
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::search,
search: starting eval gpus on node node1 need 0(0) mode -1 has 3 free 3
skip 0 depth 1
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::gpu_count,
Counted 3 gpus available on node node1
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::gpu_count,
Counted 3 gpus free on node node1
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::search,
search: successful gpus on node node1 need 0(0) mode -1 has 3 free 3 skip 0
depth 1
11/14/2013 19:15:20;0040;PBS_Server;Req;node_spec;job allocation debug(2):
1 requested, 1 svr_numnodes
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::gpu_count,
Counted 3 gpus free on node node1
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::node_spec,
starting eval gpus on node node1 need 0 free 3
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::gpu_count,
Counted 3 gpus free on node node1
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::node_spec,
adequate virtual nodes and gpus available - node is ok
11/14/2013 19:15:20;0040;PBS_Server;Req;node_spec;job allocation debug(3):
returning 1 requested
11/14/2013 19:15:20;0008;PBS_Server;Job;reply_send;Reply sent for request
type ResourceQuery on socket 14
11/14/2013 19:15:20;0080;PBS_Server;Req;dis_request_read;decoding command
ModifyJob from Scheduler
11/14/2013 19:15:20;0100;PBS_Server;Req;;Type ModifyJob request received
from Scheduler at server1.xxx.com, sock=14
11/14/2013 19:15:20;0008;PBS_Server;Job;dispatch_request;dispatching
request ModifyJob on sd=14
11/14/2013 19:15:20;0008;PBS_Server;Job;7352.server1.xxx.com;attr comment
modified
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting
job 7352.server1.xxx.com state from QUEUED-QUEUED to QUEUED-QUEUED (1-10)
11/14/2013 19:15:20;0008;PBS_Server;Job;7352.server1.xxx.com;Job Modified
at request of Scheduler at server1.xxx.com
11/14/2013 19:15:20;0008;PBS_Server;Job;reply_send;Reply sent for request
type ModifyJob on socket 14
11/14/2013 19:15:20;0080;PBS_Server;Req;dis_request_read;decoding command
RunJob from Scheduler
11/14/2013 19:15:20;0100;PBS_Server;Req;;Type RunJob request received from
Scheduler at server1.xxx.com, sock=14
11/14/2013 19:15:20;0008;PBS_Server;Job;dispatch_request;dispatching
request RunJob on sd=14
11/14/2013 19:15:20;0040;PBS_Server;Req;set_nodes;allocating nodes for job
7352.server1.xxx.com with node expression '1'
11/14/2013 19:15:20;0040;PBS_Server;Req;node_spec;entered spec=1
11/14/2013 19:15:20;0040;PBS_Server;Req;node_spec;job allocation debug: 1
requested, 16 svr_clnodes, 1 svr_totnodes
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::gpu_count,
Counted 3 gpus available on node node1
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::gpu_count,
Counted 3 gpus free on node node1
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::search,
search: starting eval gpus on node node1 need 0(0) mode -1 has 3 free 3
skip 0 depth 1
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::gpu_count,
Counted 3 gpus available on node node1
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::gpu_count,
Counted 3 gpus free on node node1
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::search,
search: successful gpus on node node1 need 0(0) mode -1 has 3 free 3 skip 0
depth 1
11/14/2013 19:15:20;0040;PBS_Server;Req;node_spec;job allocation debug(2):
1 requested, 1 svr_numnodes
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::gpu_count,
Counted 3 gpus free on node node1
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::node_spec,
starting eval gpus on node node1 need 0 free 3
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::gpu_count,
Counted 3 gpus free on node node1
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::node_spec,
adequate virtual nodes and gpus available - node is ok
11/14/2013 19:15:20;0040;PBS_Server;Req;node_spec;job allocation debug(3):
returning 1 requested
11/14/2013 19:15:20;0040;PBS_Server;Req;add_job_to_node;allocated node
node1/1 to job 7352.server1.xxx.com (nsnfree=15)
11/14/2013 19:15:20;0040;PBS_Server;Req;set_nodes;job
7352.server1.xxx.comallocated 1 nodes (nodelist=node1/1)
11/14/2013 19:15:20;0008;PBS_Server;Job;7352.server1.xxx.com;Job Run at
request of Scheduler at server1.xxx.com
11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting
job 7352.server1.xxx.com state from QUEUED-QUEUED to RUNNING-PRERUN (4-40)
11/14/2013 19:15:20;0008;PBS_Server;Job;7352.server1.xxx.com;forking in
send_job

*11/14/2013 19:15:20;0004;PBS_Server;Svr;svr_connect;attempting connect to
host 72.34.135.64 port 1500211/14/2013
19:15:20;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002
- cannot establish connection () - time=0 seconds*

*11/14/2013 19:15:22;0004;PBS_Server;Svr;svr_connect;attempting connect to
host 72.34.135.64 port 1500211/14/2013
19:15:22;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002
- cannot establish connection () - time=0 seconds*
11/14/2013 19:15:22;0008;PBS_Server;Job;7352.server1.xxx.com;entering
post_sendmom
11/14/2013 19:15:22;0002;PBS_Server;Job;7352.server1.xxx.com;child reported
failure for job after 2 seconds (dest=node1), rc=2
11/14/2013 19:15:22;0008;PBS_Server;Job;7352.server1.xxx.com;unable to run
job, MOM rejected/rc=2
11/14/2013 19:15:22;0040;PBS_Server;Req;free_nodes;freeing nodes for job
7352.server1.xxx.com
11/14/2013 19:15:22;0040;PBS_Server;Req;free_nodes;freeing node node1/1
from job 7352.server1.xxx.com (nsnfree=14)
11/14/2013 19:15:22;0040;PBS_Server;Req;free_nodes;increased sub-node free
count to 15 of 16
11/14/2013 19:15:22;0080;PBS_Server;Req;req_reject;Reject reply
code=15043(Execution server rejected request REJHOST=node1 MSG=cannot send
job to node1, state=PRERUN), aux=0, type=RunJob, from
Scheduler at server1.xxx.com
11/14/2013 19:15:22;0008;PBS_Server;Job;reply_send;Reply sent for request
type RunJob on socket 14
11/14/2013 19:15:22;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting
job 7352.server1.xxx.com state from RUNNING-PRERUN to QUEUED-QUEUED (1-10)
11/14/2013 19:15:22;0080;PBS_Server;Req;dis_request_read;decoding command
ModifyJob from Scheduler
11/14/2013 19:15:22;0100;PBS_Server;Req;;Type ModifyJob request received
from Scheduler at server1.xxx.com, sock=14
11/14/2013 19:15:22;0008;PBS_Server;Job;dispatch_request;dispatching
request ModifyJob on sd=14
11/14/2013 19:15:22;0008;PBS_Server;Job;7352.server1.xxx.com;attr comment
modified
11/14/2013 19:15:22;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting
job 7352.server1.xxx.com state from QUEUED-QUEUED to QUEUED-QUEUED (1-10)
11/14/2013 19:15:22;0008;PBS_Server;Job;7352.server1.xxx.com;Job Modified
at request of Scheduler at server1.xxx.com
11/14/2013 19:15:22;0008;PBS_Server;Job;reply_send;Reply sent for request
type ModifyJob on socket 14
11/14/2013 19:15:22;0040;PBS_Server;Svr;server1.xxx.com;Scheduler was sent
the command recyc
11/14/2013 19:15:27;0080;PBS_Server;Req;dis_request_read;decoding command
DeleteJob from user
11/14/2013 19:15:27;0100;PBS_Server;Req;;Type DeleteJob request received
from user at server1.xxx.com, sock=13
11/14/2013 19:15:27;0008;PBS_Server;Job;dispatch_request;dispatching
request DeleteJob on sd=13
11/14/2013 19:15:27;0008;PBS_Server;Job;7352.server1.xxx.com;Job deleted at
request of user at server1.xxx.com
11/14/2013 19:15:27;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting
job 7352.server1.xxx.com state from QUEUED-QUEUED to COMPLETE-COMPLETE
(6-59)
11/14/2013 19:15:27;0008;PBS_Server;Job;reply_send;Reply sent for request
type DeleteJob on socket 13
11/14/2013 19:15:27;0080;PBS_Server;Req;dis_request_read;decoding command
Disconnect from user

..snip..
--

Thanks for your time and help!

Much appreciated,
-J


On Thu, Nov 14, 2013 at 6:40 PM, David Beer <dbeer at adaptivecomputing.com>wrote:

> From the momctl output that you showed your mom's log level is at 0. I
> would change this to 10 and then look into what happens when the job is
> submitted.
>
>
> On Thu, Nov 14, 2013 at 5:48 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>
>> Anyone?  Any ideas?  I am pulling my hair out on this one and can't seem
>> to find any issues with the server or client.  Any help would be greatly
>> appreciated!
>>
>> Thanks,
>> -J
>>
>>
>> On Wed, Nov 13, 2013 at 8:11 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>>
>>> I have increased the log level on pbs_server and now I am seeing the
>>> following messages:
>>>
>>> --
>>>
>>>
>>>
>>>
>>>
>>>
>>> *11/13/2013 20:09:08;0004;PBS_Server;Svr;svr_connect;attempting connect
>>> to host 72.34.135.64 port 15002 11/13/2013
>>> 20:09:08;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002
>>> - cannot establish connection (11/13/2013
>>> 20:09:09;0004;PBS_Server;Svr;svr_connect;attempting connect to host
>>> 72.34.135.64 port 15002 11/13/2013
>>> 20:09:09;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002
>>> - cannot establish connection (11/13/2013
>>> 20:09:11;0004;PBS_Server;Svr;svr_connect;attempting connect to host
>>> 72.34.135.64 port 15002 11/13/2013
>>> 20:09:11;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002
>>> - cannot establish connection (*
>>> 11/13/2013 20:09:15;0040;PBS_Server;Req;do_rpp;rpp request received on
>>> stream 0
>>> 11/13/2013 20:09:15;0040;PBS_Server;Req;do_rpp;inter-server request
>>> received
>>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;message received from
>>> stream 0 (version 1)
>>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;message received from
>>> stream 72.34.135.64:15003
>>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;message STATUS (4)
>>> received from mom on host node1 (72.34.135.64:15003) (stream 0)
>>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;IS_STATUS received
>>> from node1
>>> 11/13/2013 20:09:15;0040;PBS_Server;Req;is_stat_get;received status from
>>> node node1
>>> 11/13/2013 20:09:15;0040;PBS_Server;Req;update_node_state;adjusting
>>> state for node node1 - state=0, newstate=0
>>> 11/13/2013 20:09:15;0040;PBS_Server;Req;do_rpp;rpp request received on
>>> stream 0
>>> 11/13/2013 20:09:15;0040;PBS_Server;Req;do_rpp;inter-server request
>>> received
>>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;message received from
>>> stream 0 (version 1)
>>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;message received from
>>> stream 72.34.135.64:15003
>>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;message GPU_STATUS
>>> (5) received from mom on host node1 (72.34.135.64:15003) (stream 0)
>>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;IS_GPU_STATUS
>>> received from node1
>>> 11/13/2013 20:09:15;0040;PBS_Server;Req;is_gpustat_get;received gpu
>>> status from node node1
>>> --
>>>
>>> On the client I do see that it is listening on port 15002:
>>>
>>> # netstat -an | grep 15002
>>> tcp        0      0 0.0.0.0:15002           0.0.0.0:*
>>> LISTEN
>>>
>>> There is no firewall configured on these servers.
>>>
>>> What am I missing?
>>>
>>> Thanks,
>>> -J
>>>
>>>
>>> On Wed, Nov 13, 2013 at 7:34 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>>>
>>>> Also, if I run "momctl -h node1 -d 2" I get a valid output but if I add
>>>> the port I get the following error:
>>>>
>>>> momctl -p 15002 -h node1 -d 2
>>>> ERROR:    query[0] 'diag2' failed on node1 (errno=0-Success:
>>>> 5-Input/output error)
>>>>
>>>> Any help would be appreciated!
>>>>
>>>> Thanks,
>>>> -J
>>>>
>>>>
>>>> On Wed, Nov 13, 2013 at 7:30 PM, Jagga Soorma <jagga13 at gmail.com>wrote:
>>>>
>>>>> I am also seeing the following messages on the client (mom):
>>>>>
>>>>> pbs_mom;Svr;pbs_mom;LOG_ERROR::Cannot assign requested address (99) in
>>>>> post_epilogue,
>>>>> pbs_mom;Svr;pbs_mom;LOG_ERROR::Cannot assign requested address (99) in
>>>>> post_epilogue,
>>>>>
>>>>> Could this be related?
>>>>>
>>>>> Thanks,
>>>>> -J
>>>>>
>>>>>
>>>>> On Wed, Nov 13, 2013 at 7:09 PM, Jagga Soorma <jagga13 at gmail.com>wrote:
>>>>>
>>>>>> The momctl command output looks normal:
>>>>>>
>>>>>> Host: node1/node1.gene.com   Version: 2.5.13   PID: 20707
>>>>>> Server[0]: server1 (10.36.244.247:15001)
>>>>>>   Init Msgs Received:     0 hellos/1 cluster-addrs
>>>>>>   Init Msgs Sent:         1 hellos
>>>>>>   Last Msg From Server:   70 seconds (StatusJob)
>>>>>>   Last Msg To Server:     14 seconds
>>>>>> HomeDirectory:          /var/spool/torque/mom_priv
>>>>>> stdout/stderr spool directory: '/var/spool/torque/spool/' (14933077
>>>>>> blocks available)
>>>>>> MOM active:             960 seconds
>>>>>> Check Poll Time:        45 seconds
>>>>>> Server Update Interval: 45 seconds
>>>>>> LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
>>>>>> Communication Model:    RPP
>>>>>> MemLocked:              TRUE  (mlock)
>>>>>> TCP Timeout:            20 seconds
>>>>>> Prolog:                 /var/spool/torque/mom_priv/prologue (disabled)
>>>>>> Alarm Time:             0 of 10 seconds
>>>>>> Trusted Client List:    10.36.244.247,72.34.135.64,127.0.0.1
>>>>>> Copy Command:           /usr/bin/scp -rpB
>>>>>> job[7264.server1.gene.com]  state=RUNNING  sidlist=19320
>>>>>> job[7265.server1.gene.com]  state=RUNNING  sidlist=19795
>>>>>> job[7266.server1.gene.com]  state=RUNNING  sidlist=20117
>>>>>> Assigned CPU Count:     3
>>>>>>
>>>>>> diagnostics complete
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 13, 2013 at 4:52 PM, Jagga Soorma <jagga13 at gmail.com>wrote:
>>>>>>
>>>>>>> It seems to be intermittent and when the job does not run then I
>>>>>>> don't see anything in the mom logs.  The other thing to point out is that
>>>>>>> this compute node is part of another torque server but has been set to
>>>>>>> offline/down mode in the production instance.  Would that have any impact
>>>>>>> of this?
>>>>>>>
>>>>>>> Also, I don't' have the momctl command on the compute node it only
>>>>>>> exists on the server.  How can I check communication between the node and
>>>>>>> server from a torque perspective?  It seems to be intermittent.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -J
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 13, 2013 at 4:45 PM, Matt Britt <msbritt at umich.edu>wrote:
>>>>>>>
>>>>>>>> I would look at the pbs_mom log at the corresponding time the job
>>>>>>>> was being run (16:31:01) as well as run momctl -d1 (or higher) on
>>>>>>>> the compute host to make sure you have two-way communication.
>>>>>>>>
>>>>>>>>  - Matt
>>>>>>>>
>>>>>>>>
>>>>>>>> --------------------------------------------
>>>>>>>> Matthew Britt
>>>>>>>> CAEN HPC Group - College of Engineering
>>>>>>>> msbritt at umich.edu
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 13, 2013 at 7:37 PM, Jagga Soorma <jagga13 at gmail.com>wrote:
>>>>>>>>
>>>>>>>>> Hey Guys,
>>>>>>>>>
>>>>>>>>> I am having some issues with a test torque deployment which only
>>>>>>>>> has 1 server and 1 compute node.  I am trying to submit a interactive job
>>>>>>>>> and the very first time it works but every subsequent time I get a Reject
>>>>>>>>> reply code=15043 and the job just stays queued and sometimes will end up
>>>>>>>>> running by giving me a prompt.  I don't see any network issues and from the
>>>>>>>>> OS communication between the server and compute node seem fine.  What am I
>>>>>>>>> missing here and what can I check to troubleshoot this further?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> server_logs:
>>>>>>>>> ..
>>>>>>>>> 11/13/2013 16:30:35;0100;PBS_Server;Job;7221.server1.xxx.com;enqueuing
>>>>>>>>> into batch, state 1 hop 1
>>>>>>>>>  11/13/2013 16:30:35;0008;PBS_Server;Job;7221.server1.xxx.com;Job
>>>>>>>>> Queued at request of user1 at server1.xxx.com, owner =
>>>>>>>>> user1 at server1.xxx.com, job name = STDIN, queue = batch
>>>>>>>>> 11/13/2013 16:30:35;0040;PBS_Server;Svr;server1.xxx.com;Scheduler
>>>>>>>>> was sent the command new
>>>>>>>>> 11/13/2013 16:30:35;0008;PBS_Server;Job;7221.server1.xxx.com;Job
>>>>>>>>> Modified at request of Scheduler at server1.xxx.com
>>>>>>>>> 11/13/2013 16:30:35;0008;PBS_Server;Job;7221.server1.xxx.com;Job
>>>>>>>>> Run at request of Scheduler at server1.xxx.com
>>>>>>>>> 11/13/2013 16:30:36;0004;PBS_Server;Svr;WARNING;ALERT: unable to
>>>>>>>>> contact node node1
>>>>>>>>> 11/13/2013 16:30:36;0040;PBS_Server;Svr;server1.xxx.com;Scheduler
>>>>>>>>> was sent the command recyc
>>>>>>>>> 11/13/2013 16:31:01;0100;PBS_Server;Job;7222.server1.xxx.com;enqueuing
>>>>>>>>> into batch, state 1 hop 1
>>>>>>>>> 11/13/2013 16:31:01;0008;PBS_Server;Job;7222.server1.xxx.com;Job
>>>>>>>>> Queued at request of user1 at server1.xxx.com, owner =
>>>>>>>>> user1 at server1.xxx.com, job name = STDIN, queue = batch
>>>>>>>>> 11/13/2013 16:31:01;0040;PBS_Server;Svr;server1.xxx.com;Scheduler
>>>>>>>>> was sent the command new
>>>>>>>>> 11/13/2013 16:31:01;0008;PBS_Server;Job;7222.server1.xxx.com;Job
>>>>>>>>> Modified at request of Scheduler at server1.xxx.com
>>>>>>>>> 11/13/2013 16:31:01;0008;PBS_Server;Job;7222.server1.xxx.com;Job
>>>>>>>>> Run at request of Scheduler at server1.xxx.com
>>>>>>>>> 11/13/2013 16:31:03;0008;PBS_Server;Job;7222.server1.xxx.com;unable
>>>>>>>>> to run job, MOM rejected/rc=2
>>>>>>>>> *11/13/2013 16:31:03;0080;PBS_Server;Req;req_reject;Reject reply
>>>>>>>>> code=15043(Execution server rejected request MSG=cannot send job to mom,
>>>>>>>>> state=PRERUN), aux=0, type=RunJob, from Scheduler at server1.xxx.com
>>>>>>>>> <Scheduler at server1.xxx.com>*
>>>>>>>>> 11/13/2013 16:31:03;0008;PBS_Server;Job;7222.server1.xxx.com;Job
>>>>>>>>> Modified at request of Scheduler at server1.xxx.com
>>>>>>>>> 11/13/2013 16:31:03;0040;PBS_Server;Svr;server1.xxx.com;Scheduler
>>>>>>>>> was sent the command recyc
>>>>>>>>> ..
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> -J
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> torqueusers mailing list
>>>>>>>>> torqueusers at supercluster.org
>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> torqueusers mailing list
>>>>>>>> torqueusers at supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
> --
> David Beer | Senior Software Engineer
> Adaptive Computing
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131114/21e2f1ee/attachment-0001.html 


More information about the torqueusers mailing list