[torqueusers] Torque behaving badly

Jagga Soorma jagga13 at gmail.com
Wed Nov 13 17:37:30 MST 2013


Hey Guys,

I am having some issues with a test torque deployment which only has 1
server and 1 compute node.  I am trying to submit a interactive job and the
very first time it works but every subsequent time I get a Reject reply
code=15043 and the job just stays queued and sometimes will end up running
by giving me a prompt.  I don't see any network issues and from the OS
communication between the server and compute node seem fine.  What am I
missing here and what can I check to troubleshoot this further?

--
server_logs:
..
11/13/2013 16:30:35;0100;PBS_Server;Job;7221.server1.xxx.com;enqueuing into
batch, state 1 hop 1
11/13/2013 16:30:35;0008;PBS_Server;Job;7221.server1.xxx.com;Job Queued at
request of user1 at server1.xxx.com, owner = user1 at server1.xxx.com, job name =
STDIN, queue = batch
11/13/2013 16:30:35;0040;PBS_Server;Svr;server1.xxx.com;Scheduler was sent
the command new
11/13/2013 16:30:35;0008;PBS_Server;Job;7221.server1.xxx.com;Job Modified
at request of Scheduler at server1.xxx.com
11/13/2013 16:30:35;0008;PBS_Server;Job;7221.server1.xxx.com;Job Run at
request of Scheduler at server1.xxx.com
11/13/2013 16:30:36;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact
node node1
11/13/2013 16:30:36;0040;PBS_Server;Svr;server1.xxx.com;Scheduler was sent
the command recyc
11/13/2013 16:31:01;0100;PBS_Server;Job;7222.server1.xxx.com;enqueuing into
batch, state 1 hop 1
11/13/2013 16:31:01;0008;PBS_Server;Job;7222.server1.xxx.com;Job Queued at
request of user1 at server1.xxx.com, owner = user1 at server1.xxx.com, job name =
STDIN, queue = batch
11/13/2013 16:31:01;0040;PBS_Server;Svr;server1.xxx.com;Scheduler was sent
the command new
11/13/2013 16:31:01;0008;PBS_Server;Job;7222.server1.xxx.com;Job Modified
at request of Scheduler at server1.xxx.com
11/13/2013 16:31:01;0008;PBS_Server;Job;7222.server1.xxx.com;Job Run at
request of Scheduler at server1.xxx.com
11/13/2013 16:31:03;0008;PBS_Server;Job;7222.server1.xxx.com;unable to run
job, MOM rejected/rc=2
*11/13/2013 16:31:03;0080;PBS_Server;Req;req_reject;Reject reply
code=15043(Execution server rejected request MSG=cannot send job to mom,
state=PRERUN), aux=0, type=RunJob, from Scheduler at server1.xxx.com
<Scheduler at server1.xxx.com>*
11/13/2013 16:31:03;0008;PBS_Server;Job;7222.server1.xxx.com;Job Modified
at request of Scheduler at server1.xxx.com
11/13/2013 16:31:03;0040;PBS_Server;Svr;server1.xxx.com;Scheduler was sent
the command recyc
..


Thanks,
-J
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131113/9bf86f94/attachment.html 


More information about the torqueusers mailing list