[torqueusers] Torque behaving badly

Jagga Soorma jagga13 at gmail.com
Wed Nov 13 20:34:06 MST 2013


Also, if I run "momctl -h node1 -d 2" I get a valid output but if I add the
port I get the following error:

momctl -p 15002 -h node1 -d 2
ERROR:    query[0] 'diag2' failed on node1 (errno=0-Success: 5-Input/output
error)

Any help would be appreciated!

Thanks,
-J


On Wed, Nov 13, 2013 at 7:30 PM, Jagga Soorma <jagga13 at gmail.com> wrote:

> I am also seeing the following messages on the client (mom):
>
> pbs_mom;Svr;pbs_mom;LOG_ERROR::Cannot assign requested address (99) in
> post_epilogue,
> pbs_mom;Svr;pbs_mom;LOG_ERROR::Cannot assign requested address (99) in
> post_epilogue,
>
> Could this be related?
>
> Thanks,
> -J
>
>
> On Wed, Nov 13, 2013 at 7:09 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>
>> The momctl command output looks normal:
>>
>> Host: node1/node1.gene.com   Version: 2.5.13   PID: 20707
>> Server[0]: server1 (10.36.244.247:15001)
>>   Init Msgs Received:     0 hellos/1 cluster-addrs
>>   Init Msgs Sent:         1 hellos
>>   Last Msg From Server:   70 seconds (StatusJob)
>>   Last Msg To Server:     14 seconds
>> HomeDirectory:          /var/spool/torque/mom_priv
>> stdout/stderr spool directory: '/var/spool/torque/spool/' (14933077
>> blocks available)
>> MOM active:             960 seconds
>> Check Poll Time:        45 seconds
>> Server Update Interval: 45 seconds
>> LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
>> Communication Model:    RPP
>> MemLocked:              TRUE  (mlock)
>> TCP Timeout:            20 seconds
>> Prolog:                 /var/spool/torque/mom_priv/prologue (disabled)
>> Alarm Time:             0 of 10 seconds
>> Trusted Client List:    10.36.244.247,72.34.135.64,127.0.0.1
>> Copy Command:           /usr/bin/scp -rpB
>> job[7264.server1.gene.com]  state=RUNNING  sidlist=19320
>> job[7265.server1.gene.com]  state=RUNNING  sidlist=19795
>> job[7266.server1.gene.com]  state=RUNNING  sidlist=20117
>> Assigned CPU Count:     3
>>
>> diagnostics complete
>>
>>
>>
>>
>>
>>
>> On Wed, Nov 13, 2013 at 4:52 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>>
>>> It seems to be intermittent and when the job does not run then I don't
>>> see anything in the mom logs.  The other thing to point out is that this
>>> compute node is part of another torque server but has been set to
>>> offline/down mode in the production instance.  Would that have any impact
>>> of this?
>>>
>>> Also, I don't' have the momctl command on the compute node it only
>>> exists on the server.  How can I check communication between the node and
>>> server from a torque perspective?  It seems to be intermittent.
>>>
>>> Thanks,
>>> -J
>>>
>>>
>>> On Wed, Nov 13, 2013 at 4:45 PM, Matt Britt <msbritt at umich.edu> wrote:
>>>
>>>> I would look at the pbs_mom log at the corresponding time the job was
>>>> being run (16:31:01) as well as run momctl -d1 (or higher) on the
>>>> compute host to make sure you have two-way communication.
>>>>
>>>>  - Matt
>>>>
>>>>
>>>> --------------------------------------------
>>>> Matthew Britt
>>>> CAEN HPC Group - College of Engineering
>>>> msbritt at umich.edu
>>>>
>>>>
>>>>
>>>> On Wed, Nov 13, 2013 at 7:37 PM, Jagga Soorma <jagga13 at gmail.com>wrote:
>>>>
>>>>> Hey Guys,
>>>>>
>>>>> I am having some issues with a test torque deployment which only has 1
>>>>> server and 1 compute node.  I am trying to submit a interactive job and the
>>>>> very first time it works but every subsequent time I get a Reject reply
>>>>> code=15043 and the job just stays queued and sometimes will end up running
>>>>> by giving me a prompt.  I don't see any network issues and from the OS
>>>>> communication between the server and compute node seem fine.  What am I
>>>>> missing here and what can I check to troubleshoot this further?
>>>>>
>>>>> --
>>>>> server_logs:
>>>>> ..
>>>>> 11/13/2013 16:30:35;0100;PBS_Server;Job;7221.server1.xxx.com;enqueuing
>>>>> into batch, state 1 hop 1
>>>>>  11/13/2013 16:30:35;0008;PBS_Server;Job;7221.server1.xxx.com;Job
>>>>> Queued at request of user1 at server1.xxx.com, owner =
>>>>> user1 at server1.xxx.com, job name = STDIN, queue = batch
>>>>> 11/13/2013 16:30:35;0040;PBS_Server;Svr;server1.xxx.com;Scheduler was
>>>>> sent the command new
>>>>> 11/13/2013 16:30:35;0008;PBS_Server;Job;7221.server1.xxx.com;Job
>>>>> Modified at request of Scheduler at server1.xxx.com
>>>>> 11/13/2013 16:30:35;0008;PBS_Server;Job;7221.server1.xxx.com;Job Run
>>>>> at request of Scheduler at server1.xxx.com
>>>>> 11/13/2013 16:30:36;0004;PBS_Server;Svr;WARNING;ALERT: unable to
>>>>> contact node node1
>>>>> 11/13/2013 16:30:36;0040;PBS_Server;Svr;server1.xxx.com;Scheduler was
>>>>> sent the command recyc
>>>>> 11/13/2013 16:31:01;0100;PBS_Server;Job;7222.server1.xxx.com;enqueuing
>>>>> into batch, state 1 hop 1
>>>>> 11/13/2013 16:31:01;0008;PBS_Server;Job;7222.server1.xxx.com;Job
>>>>> Queued at request of user1 at server1.xxx.com, owner =
>>>>> user1 at server1.xxx.com, job name = STDIN, queue = batch
>>>>> 11/13/2013 16:31:01;0040;PBS_Server;Svr;server1.xxx.com;Scheduler was
>>>>> sent the command new
>>>>> 11/13/2013 16:31:01;0008;PBS_Server;Job;7222.server1.xxx.com;Job
>>>>> Modified at request of Scheduler at server1.xxx.com
>>>>> 11/13/2013 16:31:01;0008;PBS_Server;Job;7222.server1.xxx.com;Job Run
>>>>> at request of Scheduler at server1.xxx.com
>>>>> 11/13/2013 16:31:03;0008;PBS_Server;Job;7222.server1.xxx.com;unable
>>>>> to run job, MOM rejected/rc=2
>>>>> *11/13/2013 16:31:03;0080;PBS_Server;Req;req_reject;Reject reply
>>>>> code=15043(Execution server rejected request MSG=cannot send job to mom,
>>>>> state=PRERUN), aux=0, type=RunJob, from Scheduler at server1.xxx.com
>>>>> <Scheduler at server1.xxx.com>*
>>>>> 11/13/2013 16:31:03;0008;PBS_Server;Job;7222.server1.xxx.com;Job
>>>>> Modified at request of Scheduler at server1.xxx.com
>>>>> 11/13/2013 16:31:03;0040;PBS_Server;Svr;server1.xxx.com;Scheduler was
>>>>> sent the command recyc
>>>>> ..
>>>>>
>>>>>
>>>>> Thanks,
>>>>> -J
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131113/9ca1a164/attachment-0001.html 


More information about the torqueusers mailing list