[torqueusers] {Disarmed} Re: Torque behaving badly

Jagga Soorma jagga13 at gmail.com
Thu Nov 14 20:21:08 MST 2013


Yea, I check and networking looks clean.  There was ipv6 enabled and
infiniband interfaces setup on the client but I disabled/downed all those
to make sure I had a simple setup with just one interface and route.  This
client was working just fine in our 2.5.9 environment and started having
issues in this 2.5.13 test environment which has been setup with gpu
support.

No dup ip's and mtu's isn't a issue.  Could this be a bug in the torque
version I am using?  Any configuration that I should be checking or should
I concentrate on the node itself?

Thanks,
-J


On Thu, Nov 14, 2013 at 6:42 PM, Stephen Cousins <steve.cousins at maine.edu>wrote:

> I'd probably check to make sure that networking was all clean. Any errors
> on the switches? Multiple routes? Duplex mismatch? Duplicate IP's? Maybe
> packets are getting lost sometimes? I'd start with the basics. Just a
> thought.
>
>
> On Thu, Nov 14, 2013 at 7:48 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>
>> Anyone?  Any ideas?  I am pulling my hair out on this one and can't seem
>> to find any issues with the server or client.  Any help would be greatly
>> appreciated!
>>
>> Thanks,
>> -J
>>
>>
>> On Wed, Nov 13, 2013 at 8:11 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>>
>>> I have increased the log level on pbs_server and now I am seeing the
>>> following messages:
>>>
>>> --
>>>
>>>
>>>
>>>
>>>
>>>
>>> *11/13/2013 20:09:08;0004;PBS_Server;Svr;svr_connect;attempting connect
>>> to host 72.34.135.64 port 15002 11/13/2013
>>> 20:09:08;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002
>>> - cannot establish connection (11/13/2013
>>> 20:09:09;0004;PBS_Server;Svr;svr_connect;attempting connect to host
>>> 72.34.135.64 port 15002 11/13/2013
>>> 20:09:09;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002
>>> - cannot establish connection (11/13/2013
>>> 20:09:11;0004;PBS_Server;Svr;svr_connect;attempting connect to host
>>> 72.34.135.64 port 15002 11/13/2013
>>> 20:09:11;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002
>>> - cannot establish connection (*
>>> 11/13/2013 20:09:15;0040;PBS_Server;Req;do_rpp;rpp request received on
>>> stream 0
>>> 11/13/2013 20:09:15;0040;PBS_Server;Req;do_rpp;inter-server request
>>> received
>>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;message received from
>>> stream 0 (version 1)
>>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;message received from
>>> stream *MailScanner warning: numerical links are often malicious:*72.34.135.64:15003<http://72.34.135.64:15003>
>>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;message STATUS (4)
>>> received from mom on host node1 (*MailScanner warning: numerical links
>>> are often malicious:* 72.34.135.64:15003 <http://72.34.135.64:15003>)
>>> (stream 0)
>>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;IS_STATUS received
>>> from node1
>>> 11/13/2013 20:09:15;0040;PBS_Server;Req;is_stat_get;received status from
>>> node node1
>>> 11/13/2013 20:09:15;0040;PBS_Server;Req;update_node_state;adjusting
>>> state for node node1 - state=0, newstate=0
>>> 11/13/2013 20:09:15;0040;PBS_Server;Req;do_rpp;rpp request received on
>>> stream 0
>>> 11/13/2013 20:09:15;0040;PBS_Server;Req;do_rpp;inter-server request
>>> received
>>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;message received from
>>> stream 0 (version 1)
>>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;message received from
>>> stream *MailScanner warning: numerical links are often malicious:*72.34.135.64:15003<http://72.34.135.64:15003>
>>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;message GPU_STATUS
>>> (5) received from mom on host node1 (*MailScanner warning: numerical
>>> links are often malicious:* 72.34.135.64:15003<http://72.34.135.64:15003>)
>>> (stream 0)
>>> 11/13/2013 20:09:15;0004;PBS_Server;Svr;is_request;IS_GPU_STATUS
>>> received from node1
>>> 11/13/2013 20:09:15;0040;PBS_Server;Req;is_gpustat_get;received gpu
>>> status from node node1
>>> --
>>>
>>> On the client I do see that it is listening on port 15002:
>>>
>>> # netstat -an | grep 15002
>>> tcp        0      0 *MailScanner warning: numerical links are often
>>> malicious:* 0.0.0.0:15002 <http://0.0.0.0:15002>           0.0.0.0:*
>>> LISTEN
>>>
>>> There is no firewall configured on these servers.
>>>
>>> What am I missing?
>>>
>>> Thanks,
>>> -J
>>>
>>>
>>> On Wed, Nov 13, 2013 at 7:34 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>>>
>>>> Also, if I run "momctl -h node1 -d 2" I get a valid output but if I add
>>>> the port I get the following error:
>>>>
>>>> momctl -p 15002 -h node1 -d 2
>>>> ERROR:    query[0] 'diag2' failed on node1 (errno=0-Success:
>>>> 5-Input/output error)
>>>>
>>>> Any help would be appreciated!
>>>>
>>>> Thanks,
>>>> -J
>>>>
>>>>
>>>> On Wed, Nov 13, 2013 at 7:30 PM, Jagga Soorma <jagga13 at gmail.com>wrote:
>>>>
>>>>> I am also seeing the following messages on the client (mom):
>>>>>
>>>>> pbs_mom;Svr;pbs_mom;LOG_ERROR::Cannot assign requested address (99) in
>>>>> post_epilogue,
>>>>> pbs_mom;Svr;pbs_mom;LOG_ERROR::Cannot assign requested address (99) in
>>>>> post_epilogue,
>>>>>
>>>>> Could this be related?
>>>>>
>>>>> Thanks,
>>>>> -J
>>>>>
>>>>>
>>>>> On Wed, Nov 13, 2013 at 7:09 PM, Jagga Soorma <jagga13 at gmail.com>wrote:
>>>>>
>>>>>> The momctl command output looks normal:
>>>>>>
>>>>>> Host: node1/node1.gene.com   Version: 2.5.13   PID: 20707
>>>>>> Server[0]: server1 (*MailScanner warning: numerical links are often
>>>>>> malicious:* 10.36.244.247:15001 <http://10.36.244.247:15001>)
>>>>>>   Init Msgs Received:     0 hellos/1 cluster-addrs
>>>>>>   Init Msgs Sent:         1 hellos
>>>>>>   Last Msg From Server:   70 seconds (StatusJob)
>>>>>>   Last Msg To Server:     14 seconds
>>>>>> HomeDirectory:          /var/spool/torque/mom_priv
>>>>>> stdout/stderr spool directory: '/var/spool/torque/spool/' (14933077
>>>>>> blocks available)
>>>>>> MOM active:             960 seconds
>>>>>> Check Poll Time:        45 seconds
>>>>>> Server Update Interval: 45 seconds
>>>>>> LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
>>>>>> Communication Model:    RPP
>>>>>> MemLocked:              TRUE  (mlock)
>>>>>> TCP Timeout:            20 seconds
>>>>>> Prolog:                 /var/spool/torque/mom_priv/prologue (disabled)
>>>>>> Alarm Time:             0 of 10 seconds
>>>>>> Trusted Client List:    10.36.244.247,72.34.135.64,127.0.0.1
>>>>>> Copy Command:           /usr/bin/scp -rpB
>>>>>> job[7264.server1.gene.com]  state=RUNNING  sidlist=19320
>>>>>> job[7265.server1.gene.com]  state=RUNNING  sidlist=19795
>>>>>> job[7266.server1.gene.com]  state=RUNNING  sidlist=20117
>>>>>> Assigned CPU Count:     3
>>>>>>
>>>>>> diagnostics complete
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 13, 2013 at 4:52 PM, Jagga Soorma <jagga13 at gmail.com>wrote:
>>>>>>
>>>>>>> It seems to be intermittent and when the job does not run then I
>>>>>>> don't see anything in the mom logs.  The other thing to point out is that
>>>>>>> this compute node is part of another torque server but has been set to
>>>>>>> offline/down mode in the production instance.  Would that have any impact
>>>>>>> of this?
>>>>>>>
>>>>>>> Also, I don't' have the momctl command on the compute node it only
>>>>>>> exists on the server.  How can I check communication between the node and
>>>>>>> server from a torque perspective?  It seems to be intermittent.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -J
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 13, 2013 at 4:45 PM, Matt Britt <msbritt at umich.edu>wrote:
>>>>>>>
>>>>>>>> I would look at the pbs_mom log at the corresponding time the job
>>>>>>>> was being run (16:31:01) as well as run momctl -d1 (or higher) on
>>>>>>>> the compute host to make sure you have two-way communication.
>>>>>>>>
>>>>>>>>  - Matt
>>>>>>>>
>>>>>>>>
>>>>>>>> --------------------------------------------
>>>>>>>> Matthew Britt
>>>>>>>> CAEN HPC Group - College of Engineering
>>>>>>>> msbritt at umich.edu
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 13, 2013 at 7:37 PM, Jagga Soorma <jagga13 at gmail.com>wrote:
>>>>>>>>
>>>>>>>>> Hey Guys,
>>>>>>>>>
>>>>>>>>> I am having some issues with a test torque deployment which only
>>>>>>>>> has 1 server and 1 compute node.  I am trying to submit a interactive job
>>>>>>>>> and the very first time it works but every subsequent time I get a Reject
>>>>>>>>> reply code=15043 and the job just stays queued and sometimes will end up
>>>>>>>>> running by giving me a prompt.  I don't see any network issues and from the
>>>>>>>>> OS communication between the server and compute node seem fine.  What am I
>>>>>>>>> missing here and what can I check to troubleshoot this further?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> server_logs:
>>>>>>>>> ..
>>>>>>>>> 11/13/2013 16:30:35;0100;PBS_Server;Job;7221.server1.xxx.com;enqueuing
>>>>>>>>> into batch, state 1 hop 1
>>>>>>>>>  11/13/2013 16:30:35;0008;PBS_Server;Job;7221.server1.xxx.com;Job
>>>>>>>>> Queued at request of user1 at server1.xxx.com, owner =
>>>>>>>>> user1 at server1.xxx.com, job name = STDIN, queue = batch
>>>>>>>>> 11/13/2013 16:30:35;0040;PBS_Server;Svr;server1.xxx.com;Scheduler
>>>>>>>>> was sent the command new
>>>>>>>>> 11/13/2013 16:30:35;0008;PBS_Server;Job;7221.server1.xxx.com;Job
>>>>>>>>> Modified at request of Scheduler at server1.xxx.com
>>>>>>>>> 11/13/2013 16:30:35;0008;PBS_Server;Job;7221.server1.xxx.com;Job
>>>>>>>>> Run at request of Scheduler at server1.xxx.com
>>>>>>>>> 11/13/2013 16:30:36;0004;PBS_Server;Svr;WARNING;ALERT: unable to
>>>>>>>>> contact node node1
>>>>>>>>> 11/13/2013 16:30:36;0040;PBS_Server;Svr;server1.xxx.com;Scheduler
>>>>>>>>> was sent the command recyc
>>>>>>>>> 11/13/2013 16:31:01;0100;PBS_Server;Job;7222.server1.xxx.com;enqueuing
>>>>>>>>> into batch, state 1 hop 1
>>>>>>>>> 11/13/2013 16:31:01;0008;PBS_Server;Job;7222.server1.xxx.com;Job
>>>>>>>>> Queued at request of user1 at server1.xxx.com, owner =
>>>>>>>>> user1 at server1.xxx.com, job name = STDIN, queue = batch
>>>>>>>>> 11/13/2013 16:31:01;0040;PBS_Server;Svr;server1.xxx.com;Scheduler
>>>>>>>>> was sent the command new
>>>>>>>>> 11/13/2013 16:31:01;0008;PBS_Server;Job;7222.server1.xxx.com;Job
>>>>>>>>> Modified at request of Scheduler at server1.xxx.com
>>>>>>>>> 11/13/2013 16:31:01;0008;PBS_Server;Job;7222.server1.xxx.com;Job
>>>>>>>>> Run at request of Scheduler at server1.xxx.com
>>>>>>>>> 11/13/2013 16:31:03;0008;PBS_Server;Job;7222.server1.xxx.com;unable
>>>>>>>>> to run job, MOM rejected/rc=2
>>>>>>>>> *11/13/2013 16:31:03;0080;PBS_Server;Req;req_reject;Reject reply
>>>>>>>>> code=15043(Execution server rejected request MSG=cannot send job to mom,
>>>>>>>>> state=PRERUN), aux=0, type=RunJob, from Scheduler at server1.xxx.com
>>>>>>>>> <Scheduler at server1.xxx.com>*
>>>>>>>>> 11/13/2013 16:31:03;0008;PBS_Server;Job;7222.server1.xxx.com;Job
>>>>>>>>> Modified at request of Scheduler at server1.xxx.com
>>>>>>>>> 11/13/2013 16:31:03;0040;PBS_Server;Svr;server1.xxx.com;Scheduler
>>>>>>>>> was sent the command recyc
>>>>>>>>> ..
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> -J
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> torqueusers mailing list
>>>>>>>>> torqueusers at supercluster.org
>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> torqueusers mailing list
>>>>>>>> torqueusers at supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131114/5a7e34be/attachment-0001.html 


More information about the torqueusers mailing list