[torqueusers] Torque behaving badly

David Beer dbeer at adaptivecomputing.com
Thu Nov 14 19:40:45 MST 2013


>From the momctl output that you showed your mom's log level is at 0. I
would change this to 10 and then look into what happens when the job is
submitted.


On Thu, Nov 14, 2013 at 5:48 PM, Jagga Soorma <jagga13 at gmail.com> wrote:

> Anyone?  Any ideas?  I am pulling my hair out on this one and can't seem
> to find any issues with the server or client.  Any help would be greatly
> appreciated!
>
> Thanks,
> -J
>
>
> On Wed, Nov 13, 2013 at 8:11 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>
>> I have increased the log level on pbs_server and now I am seeing the
>> following messages:
>>
>> --
>>
>>
>>
>>
>>
>>
>> *11/13/2013 20:09:08;0004;PBS_Server;Svr;svr_connect;attempting connect
>> to host 72.34.135.64 port 15002 11/13/2013
>> 20:09:08;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002
>> - cannot establish connection (11/13/2013
>> 20:09:09;0004;PBS_Server;Svr;svr_connect;attempting connect to host
>> 72.34.135.64 port 15002 11/13/2013
>> 20:09:09;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002
>> - cannot establish connection (11/13/2013
>> 20:09:11;0004;PBS_Server;Svr;svr_connect;attempting connect to host
>> 72.34.135.64 port 15002 11/13/2013
>> 20:09:11;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002
>> - cannot establish connection (*
>> 11/13/2013 20:09:15;0040;PBS_Server;Req;do_rpp;rpp request received on
>> stream 0
>> 11/13/2013 20:09:15;0040;PBS_Server;Req;do_rpp;inter-server request
>> received
>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;message received from
>> stream 0 (version 1)
>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;message received from
>> stream 72.34.135.64:15003
>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;message STATUS (4)
>> received from mom on host node1 (72.34.135.64:15003) (stream 0)
>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;IS_STATUS received
>> from node1
>> 11/13/2013 20:09:15;0040;PBS_Server;Req;is_stat_get;received status from
>> node node1
>> 11/13/2013 20:09:15;0040;PBS_Server;Req;update_node_state;adjusting state
>> for node node1 - state=0, newstate=0
>> 11/13/2013 20:09:15;0040;PBS_Server;Req;do_rpp;rpp request received on
>> stream 0
>> 11/13/2013 20:09:15;0040;PBS_Server;Req;do_rpp;inter-server request
>> received
>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;message received from
>> stream 0 (version 1)
>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;message received from
>> stream 72.34.135.64:15003
>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;message GPU_STATUS (5)
>> received from mom on host node1 (72.34.135.64:15003) (stream 0)
>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;IS_GPU_STATUS received
>> from node1
>> 11/13/2013 20:09:15;0040;PBS_Server;Req;is_gpustat_get;received gpu
>> status from node node1
>> --
>>
>> On the client I do see that it is listening on port 15002:
>>
>> # netstat -an | grep 15002
>> tcp        0      0 0.0.0.0:15002           0.0.0.0:*
>> LISTEN
>>
>> There is no firewall configured on these servers.
>>
>> What am I missing?
>>
>> Thanks,
>> -J
>>
>>
>> On Wed, Nov 13, 2013 at 7:34 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>>
>>> Also, if I run "momctl -h node1 -d 2" I get a valid output but if I add
>>> the port I get the following error:
>>>
>>> momctl -p 15002 -h node1 -d 2
>>> ERROR:    query[0] 'diag2' failed on node1 (errno=0-Success:
>>> 5-Input/output error)
>>>
>>> Any help would be appreciated!
>>>
>>> Thanks,
>>> -J
>>>
>>>
>>> On Wed, Nov 13, 2013 at 7:30 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>>>
>>>> I am also seeing the following messages on the client (mom):
>>>>
>>>> pbs_mom;Svr;pbs_mom;LOG_ERROR::Cannot assign requested address (99) in
>>>> post_epilogue,
>>>> pbs_mom;Svr;pbs_mom;LOG_ERROR::Cannot assign requested address (99) in
>>>> post_epilogue,
>>>>
>>>> Could this be related?
>>>>
>>>> Thanks,
>>>> -J
>>>>
>>>>
>>>> On Wed, Nov 13, 2013 at 7:09 PM, Jagga Soorma <jagga13 at gmail.com>wrote:
>>>>
>>>>> The momctl command output looks normal:
>>>>>
>>>>> Host: node1/node1.gene.com   Version: 2.5.13   PID: 20707
>>>>> Server[0]: server1 (10.36.244.247:15001)
>>>>>   Init Msgs Received:     0 hellos/1 cluster-addrs
>>>>>   Init Msgs Sent:         1 hellos
>>>>>   Last Msg From Server:   70 seconds (StatusJob)
>>>>>   Last Msg To Server:     14 seconds
>>>>> HomeDirectory:          /var/spool/torque/mom_priv
>>>>> stdout/stderr spool directory: '/var/spool/torque/spool/' (14933077
>>>>> blocks available)
>>>>> MOM active:             960 seconds
>>>>> Check Poll Time:        45 seconds
>>>>> Server Update Interval: 45 seconds
>>>>> LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
>>>>> Communication Model:    RPP
>>>>> MemLocked:              TRUE  (mlock)
>>>>> TCP Timeout:            20 seconds
>>>>> Prolog:                 /var/spool/torque/mom_priv/prologue (disabled)
>>>>> Alarm Time:             0 of 10 seconds
>>>>> Trusted Client List:    10.36.244.247,72.34.135.64,127.0.0.1
>>>>> Copy Command:           /usr/bin/scp -rpB
>>>>> job[7264.server1.gene.com]  state=RUNNING  sidlist=19320
>>>>> job[7265.server1.gene.com]  state=RUNNING  sidlist=19795
>>>>> job[7266.server1.gene.com]  state=RUNNING  sidlist=20117
>>>>> Assigned CPU Count:     3
>>>>>
>>>>> diagnostics complete
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Nov 13, 2013 at 4:52 PM, Jagga Soorma <jagga13 at gmail.com>wrote:
>>>>>
>>>>>> It seems to be intermittent and when the job does not run then I
>>>>>> don't see anything in the mom logs.  The other thing to point out is that
>>>>>> this compute node is part of another torque server but has been set to
>>>>>> offline/down mode in the production instance.  Would that have any impact
>>>>>> of this?
>>>>>>
>>>>>> Also, I don't' have the momctl command on the compute node it only
>>>>>> exists on the server.  How can I check communication between the node and
>>>>>> server from a torque perspective?  It seems to be intermittent.
>>>>>>
>>>>>> Thanks,
>>>>>> -J
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 13, 2013 at 4:45 PM, Matt Britt <msbritt at umich.edu>wrote:
>>>>>>
>>>>>>> I would look at the pbs_mom log at the corresponding time the job
>>>>>>> was being run (16:31:01) as well as run momctl -d1 (or higher) on
>>>>>>> the compute host to make sure you have two-way communication.
>>>>>>>
>>>>>>>  - Matt
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------
>>>>>>> Matthew Britt
>>>>>>> CAEN HPC Group - College of Engineering
>>>>>>> msbritt at umich.edu
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 13, 2013 at 7:37 PM, Jagga Soorma <jagga13 at gmail.com>wrote:
>>>>>>>
>>>>>>>> Hey Guys,
>>>>>>>>
>>>>>>>> I am having some issues with a test torque deployment which only
>>>>>>>> has 1 server and 1 compute node.  I am trying to submit a interactive job
>>>>>>>> and the very first time it works but every subsequent time I get a Reject
>>>>>>>> reply code=15043 and the job just stays queued and sometimes will end up
>>>>>>>> running by giving me a prompt.  I don't see any network issues and from the
>>>>>>>> OS communication between the server and compute node seem fine.  What am I
>>>>>>>> missing here and what can I check to troubleshoot this further?
>>>>>>>>
>>>>>>>> --
>>>>>>>> server_logs:
>>>>>>>> ..
>>>>>>>> 11/13/2013 16:30:35;0100;PBS_Server;Job;7221.server1.xxx.com;enqueuing
>>>>>>>> into batch, state 1 hop 1
>>>>>>>>  11/13/2013 16:30:35;0008;PBS_Server;Job;7221.server1.xxx.com;Job
>>>>>>>> Queued at request of user1 at server1.xxx.com, owner =
>>>>>>>> user1 at server1.xxx.com, job name = STDIN, queue = batch
>>>>>>>> 11/13/2013 16:30:35;0040;PBS_Server;Svr;server1.xxx.com;Scheduler
>>>>>>>> was sent the command new
>>>>>>>> 11/13/2013 16:30:35;0008;PBS_Server;Job;7221.server1.xxx.com;Job
>>>>>>>> Modified at request of Scheduler at server1.xxx.com
>>>>>>>> 11/13/2013 16:30:35;0008;PBS_Server;Job;7221.server1.xxx.com;Job
>>>>>>>> Run at request of Scheduler at server1.xxx.com
>>>>>>>> 11/13/2013 16:30:36;0004;PBS_Server;Svr;WARNING;ALERT: unable to
>>>>>>>> contact node node1
>>>>>>>> 11/13/2013 16:30:36;0040;PBS_Server;Svr;server1.xxx.com;Scheduler
>>>>>>>> was sent the command recyc
>>>>>>>> 11/13/2013 16:31:01;0100;PBS_Server;Job;7222.server1.xxx.com;enqueuing
>>>>>>>> into batch, state 1 hop 1
>>>>>>>> 11/13/2013 16:31:01;0008;PBS_Server;Job;7222.server1.xxx.com;Job
>>>>>>>> Queued at request of user1 at server1.xxx.com, owner =
>>>>>>>> user1 at server1.xxx.com, job name = STDIN, queue = batch
>>>>>>>> 11/13/2013 16:31:01;0040;PBS_Server;Svr;server1.xxx.com;Scheduler
>>>>>>>> was sent the command new
>>>>>>>> 11/13/2013 16:31:01;0008;PBS_Server;Job;7222.server1.xxx.com;Job
>>>>>>>> Modified at request of Scheduler at server1.xxx.com
>>>>>>>> 11/13/2013 16:31:01;0008;PBS_Server;Job;7222.server1.xxx.com;Job
>>>>>>>> Run at request of Scheduler at server1.xxx.com
>>>>>>>> 11/13/2013 16:31:03;0008;PBS_Server;Job;7222.server1.xxx.com;unable
>>>>>>>> to run job, MOM rejected/rc=2
>>>>>>>> *11/13/2013 16:31:03;0080;PBS_Server;Req;req_reject;Reject reply
>>>>>>>> code=15043(Execution server rejected request MSG=cannot send job to mom,
>>>>>>>> state=PRERUN), aux=0, type=RunJob, from Scheduler at server1.xxx.com
>>>>>>>> <Scheduler at server1.xxx.com>*
>>>>>>>> 11/13/2013 16:31:03;0008;PBS_Server;Job;7222.server1.xxx.com;Job
>>>>>>>> Modified at request of Scheduler at server1.xxx.com
>>>>>>>> 11/13/2013 16:31:03;0040;PBS_Server;Svr;server1.xxx.com;Scheduler
>>>>>>>> was sent the command recyc
>>>>>>>> ..
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -J
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> torqueusers mailing list
>>>>>>>> torqueusers at supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> torqueusers at supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131114/c436e885/attachment.html 


More information about the torqueusers mailing list