[torquedev] Problems starting job with torque 2.1.6

Åke Sandgren ake.sandgren at hpc2n.umu.se
Tue Jan 16 02:49:52 MST 2007


On Tue, 2007-01-16 at 10:40 +0100, Lennart Karlsson wrote:
> ake.sandgren at hpc2n.umu.se said:
> > I just had a semi-large job (90 nodes) fail to start due to masternode
> > not sending out the JOIN_JOB to all sisters or sister not receiving it
> > at least.
> >
> > Anyone seen anything like this? 
> 
> Yes, I am also surprised by a similar behaviour now. Since last week a
> 128 node job fails to start on a 199 node cluster. Also on Torque 2.1.6.
> 
> Master node says:
> 01/16/2007 08:53:00;0008;   pbs_mom;Job;374675.moonwatch;Job Modified at 
> request of PBS_Server at s1
> 01/16/2007 08:56:09;0002;   pbs_mom;Svr;im_eof;Premature end of message from 
> addr 192.168.1.13:15003
> 01/16/2007 08:56:09;0001;   pbs_mom;Svr;pbs_mom;sister could not communicate 
> (15059) in 374675.moonwatch, job_start_error from node n13 in job_start_error
> 01/16/2007 08:56:09;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
> 01/16/2007 08:56:09;0001;   pbs_mom;Svr;pbs_mom;job_start_error, 
> job_start_error: sent 124 ABORT requests, should be 127
> 01/16/2007 08:56:09;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
> 01/16/2007 08:56:09;0001;   pbs_mom;Svr;pbs_mom;node_bailout, node_bailout: 
> received KILL/ABORT request for job 374675.moonwatch from node n13
> 01/16/2007 08:56:09;0001;   pbs_mom;Svr;pbs_mom;node_bailout, node_bailout: 
> received KILL/ABORT request for job 374675.moonwatch from node n13
> 01/16/2007 08:56:09;0001;   pbs_mom;Svr;pbs_mom;im_request, event 769744 
> taskid 0 not found
> 01/16/2007 08:56:09;0001;   pbs_mom;Svr;pbs_mom;im_request, job 
> 374675.moonwatch: command 99
> 01/16/2007 08:56:09;0002;   pbs_mom;Svr;im_eof;No error from addr 
> 192.168.1.13:15003
> 01/16/2007 08:56:09;0001;   pbs_mom;Req;obit reply;Job not found for obit reply
> 
> Node n13 says:
> 01/16/2007 08:56:09;0008;   pbs_mom;Job;374675.moonwatch;ERROR:    received 
> request 'ABORT_JOB' from 192.168.1.191:1023 for job '374675.moonwatch' (job 
> does not exist locally)
> 01/16/2007 08:56:09;0008;   pbs_mom;Job;374675.moonwatch;ERROR:    received 
> request 'ABORT_JOB' from 192.168.1.191:1023 for job '374675.moonwatch' (job 
> does not exist locally)
> 01/16/2007 08:56:09;0002;   pbs_mom;Svr;im_eof;End of File from addr 
> 192.168.1.191:1023
> 
> And now logs from an earlier start try.
> 
> Master node:
> 01/12/2007 23:01:51;0008;   pbs_mom;Job;374675.moonwatch;Job Modified at 
> request of PBS_Server at s1
> 01/12/2007 23:05:00;0002;   pbs_mom;Svr;im_eof;Premature end of message from 
> addr 192.168.1.86:15003
> 01/12/2007 23:05:00;0001;   pbs_mom;Svr;pbs_mom;sister could not communicate 
> (15059) in 374675.moonwatch, job_start_error from node n86 in job_start_error
> 01/12/2007 23:05:00;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
> 01/12/2007 23:05:00;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
> 01/12/2007 23:05:00;0001;   pbs_mom;Svr;pbs_mom;node_bailout, node_bailout: 
> received KILL/ABORT request for job 374675.moonwatch from node n86
> 01/12/2007 23:05:00;0001;   pbs_mom;Svr;pbs_mom;node_bailout, node_bailout: 
> received KILL/ABORT request for job 374675.moonwatch from node n86
> 01/12/2007 23:05:00;0001;   pbs_mom;Svr;pbs_mom;im_request, event 496710 
> taskid 0 not found
> 01/12/2007 23:05:00;0001;   pbs_mom;Svr;pbs_mom;im_request, job 
> 374675.moonwatch: command 99
> 01/12/2007 23:05:00;0002;   pbs_mom;Svr;im_eof;No error from addr 
> 192.168.1.86:15003
> 01/12/2007 23:05:00;0001;   pbs_mom;Req;obit reply;Job not found for obit reply
> 
> Node n86:
> 01/12/2007 23:05:00;0008;   pbs_mom;Job;374675.moonwatch;ERROR:    received 
> request 'ABORT_JOB' from 192.168.1.199:1023 for job '374675.moonwatch' (job 
> does not exist locally)
> 01/12/2007 23:05:00;0008;   pbs_mom;Job;374675.moonwatch;ERROR:    received 
> request 'ABORT_JOB' from 192.168.1.199:1023 for job '374675.moonwatch' (job 
> does not exist locally)
> 01/12/2007 23:05:00;0002;   pbs_mom;Svr;im_eof;End of File from addr 
> 192.168.1.199:1023
> 
> 
> So, in neither case does the sister log  "JOIN JOB" (yes, job sisters here
> usually do).
> 
> Is this the behaviour you are talking about? 

Yes exactly.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ake at hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



More information about the torquedev mailing list