[torqueusers] Torque behaving badly

Jagga Soorma jagga13 at gmail.com
Wed Nov 13 17:52:33 MST 2013


It seems to be intermittent and when the job does not run then I don't see
anything in the mom logs.  The other thing to point out is that this
compute node is part of another torque server but has been set to
offline/down mode in the production instance.  Would that have any impact
of this?

Also, I don't' have the momctl command on the compute node it only exists
on the server.  How can I check communication between the node and server
from a torque perspective?  It seems to be intermittent.

Thanks,
-J


On Wed, Nov 13, 2013 at 4:45 PM, Matt Britt <msbritt at umich.edu> wrote:

> I would look at the pbs_mom log at the corresponding time the job was
> being run (16:31:01) as well as run momctl -d1 (or higher) on the compute
> host to make sure you have two-way communication.
>
>  - Matt
>
>
> --------------------------------------------
> Matthew Britt
> CAEN HPC Group - College of Engineering
> msbritt at umich.edu
>
>
>
> On Wed, Nov 13, 2013 at 7:37 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>
>> Hey Guys,
>>
>> I am having some issues with a test torque deployment which only has 1
>> server and 1 compute node.  I am trying to submit a interactive job and the
>> very first time it works but every subsequent time I get a Reject reply
>> code=15043 and the job just stays queued and sometimes will end up running
>> by giving me a prompt.  I don't see any network issues and from the OS
>> communication between the server and compute node seem fine.  What am I
>> missing here and what can I check to troubleshoot this further?
>>
>> --
>> server_logs:
>> ..
>> 11/13/2013 16:30:35;0100;PBS_Server;Job;7221.server1.xxx.com;enqueuing
>> into batch, state 1 hop 1
>>  11/13/2013 16:30:35;0008;PBS_Server;Job;7221.server1.xxx.com;Job Queued
>> at request of user1 at server1.xxx.com, owner = user1 at server1.xxx.com, job
>> name = STDIN, queue = batch
>> 11/13/2013 16:30:35;0040;PBS_Server;Svr;server1.xxx.com;Scheduler was
>> sent the command new
>> 11/13/2013 16:30:35;0008;PBS_Server;Job;7221.server1.xxx.com;Job
>> Modified at request of Scheduler at server1.xxx.com
>> 11/13/2013 16:30:35;0008;PBS_Server;Job;7221.server1.xxx.com;Job Run at
>> request of Scheduler at server1.xxx.com
>> 11/13/2013 16:30:36;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact
>> node node1
>> 11/13/2013 16:30:36;0040;PBS_Server;Svr;server1.xxx.com;Scheduler was
>> sent the command recyc
>> 11/13/2013 16:31:01;0100;PBS_Server;Job;7222.server1.xxx.com;enqueuing
>> into batch, state 1 hop 1
>> 11/13/2013 16:31:01;0008;PBS_Server;Job;7222.server1.xxx.com;Job Queued
>> at request of user1 at server1.xxx.com, owner = user1 at server1.xxx.com, job
>> name = STDIN, queue = batch
>> 11/13/2013 16:31:01;0040;PBS_Server;Svr;server1.xxx.com;Scheduler was
>> sent the command new
>> 11/13/2013 16:31:01;0008;PBS_Server;Job;7222.server1.xxx.com;Job
>> Modified at request of Scheduler at server1.xxx.com
>> 11/13/2013 16:31:01;0008;PBS_Server;Job;7222.server1.xxx.com;Job Run at
>> request of Scheduler at server1.xxx.com
>> 11/13/2013 16:31:03;0008;PBS_Server;Job;7222.server1.xxx.com;unable to
>> run job, MOM rejected/rc=2
>> *11/13/2013 16:31:03;0080;PBS_Server;Req;req_reject;Reject reply
>> code=15043(Execution server rejected request MSG=cannot send job to mom,
>> state=PRERUN), aux=0, type=RunJob, from Scheduler at server1.xxx.com
>> <Scheduler at server1.xxx.com>*
>> 11/13/2013 16:31:03;0008;PBS_Server;Job;7222.server1.xxx.com;Job
>> Modified at request of Scheduler at server1.xxx.com
>> 11/13/2013 16:31:03;0040;PBS_Server;Svr;server1.xxx.com;Scheduler was
>> sent the command recyc
>> ..
>>
>>
>> Thanks,
>> -J
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131113/6157fecd/attachment.html 


More information about the torqueusers mailing list