[torqueusers] Torque behaving badly

Matt Britt msbritt at umich.edu
Wed Nov 13 17:45:24 MST 2013


I would look at the pbs_mom log at the corresponding time the job was being
run (16:31:01) as well as run momctl -d1 (or higher) on the compute host to
make sure you have two-way communication.

 - Matt


--------------------------------------------
Matthew Britt
CAEN HPC Group - College of Engineering
msbritt at umich.edu



On Wed, Nov 13, 2013 at 7:37 PM, Jagga Soorma <jagga13 at gmail.com> wrote:

> Hey Guys,
>
> I am having some issues with a test torque deployment which only has 1
> server and 1 compute node.  I am trying to submit a interactive job and the
> very first time it works but every subsequent time I get a Reject reply
> code=15043 and the job just stays queued and sometimes will end up running
> by giving me a prompt.  I don't see any network issues and from the OS
> communication between the server and compute node seem fine.  What am I
> missing here and what can I check to troubleshoot this further?
>
> --
> server_logs:
> ..
> 11/13/2013 16:30:35;0100;PBS_Server;Job;7221.server1.xxx.com;enqueuing
> into batch, state 1 hop 1
> 11/13/2013 16:30:35;0008;PBS_Server;Job;7221.server1.xxx.com;Job Queued
> at request of user1 at server1.xxx.com, owner = user1 at server1.xxx.com, job
> name = STDIN, queue = batch
> 11/13/2013 16:30:35;0040;PBS_Server;Svr;server1.xxx.com;Scheduler was
> sent the command new
> 11/13/2013 16:30:35;0008;PBS_Server;Job;7221.server1.xxx.com;Job Modified
> at request of Scheduler at server1.xxx.com
> 11/13/2013 16:30:35;0008;PBS_Server;Job;7221.server1.xxx.com;Job Run at
> request of Scheduler at server1.xxx.com
> 11/13/2013 16:30:36;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact
> node node1
> 11/13/2013 16:30:36;0040;PBS_Server;Svr;server1.xxx.com;Scheduler was
> sent the command recyc
> 11/13/2013 16:31:01;0100;PBS_Server;Job;7222.server1.xxx.com;enqueuing
> into batch, state 1 hop 1
> 11/13/2013 16:31:01;0008;PBS_Server;Job;7222.server1.xxx.com;Job Queued
> at request of user1 at server1.xxx.com, owner = user1 at server1.xxx.com, job
> name = STDIN, queue = batch
> 11/13/2013 16:31:01;0040;PBS_Server;Svr;server1.xxx.com;Scheduler was
> sent the command new
> 11/13/2013 16:31:01;0008;PBS_Server;Job;7222.server1.xxx.com;Job Modified
> at request of Scheduler at server1.xxx.com
> 11/13/2013 16:31:01;0008;PBS_Server;Job;7222.server1.xxx.com;Job Run at
> request of Scheduler at server1.xxx.com
> 11/13/2013 16:31:03;0008;PBS_Server;Job;7222.server1.xxx.com;unable to
> run job, MOM rejected/rc=2
> *11/13/2013 16:31:03;0080;PBS_Server;Req;req_reject;Reject reply
> code=15043(Execution server rejected request MSG=cannot send job to mom,
> state=PRERUN), aux=0, type=RunJob, from Scheduler at server1.xxx.com
> <Scheduler at server1.xxx.com>*
> 11/13/2013 16:31:03;0008;PBS_Server;Job;7222.server1.xxx.com;Job Modified
> at request of Scheduler at server1.xxx.com
> 11/13/2013 16:31:03;0040;PBS_Server;Svr;server1.xxx.com;Scheduler was
> sent the command recyc
> ..
>
>
> Thanks,
> -J
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131113/19c4b341/attachment-0001.html 


More information about the torqueusers mailing list