[torqueusers] Strange job start
Garrick Staples
garrick at usc.edu
Thu Mar 30 10:03:34 MST 2006
On Thu, Mar 30, 2006 at 04:26:26PM +0200, Danny Sternkopf alleged:
> Hi,
>
> on our 200nodes cluster we sometimes see mulit-node jobs which need
> almost 10minutes to start after the scheduler(Maui) has decided it.
>
> The job has state running and the Mom on the first node got the batch
> script and the job infos. With strace we could see that MOm is talking
> to the other MOMs from this job and to the PBS server, but the prologue
> and the batch script are not running yet.
>
> After a couple of minutes the following messages appear in the MOM log
> on the first node:
>
> 03/30/2006 15:40:18;0008; pbs_mom;Job;9065.cacau1.nec;Job Modified at
> request of PBS_Server at cacau1.nec
> 03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;im_eof, Premature end of
> message from addr 172.16.9.59:15003
> 03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;sister could not
> communicate (15059) in 9065.cacau1.nec, job_start_error from node
> noco059.nec
> in job_start_error
> 03/30/2006 15:47:31;0008; pbs_mom;Req;send_sisters;sending ABORT to
> sisters
> 03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;job_start_error,
> job_start_error: sent 19 ABORT requests, should be 32
> 03/30/2006 15:47:31;0008; pbs_mom;Req;send_sisters;sending ABORT to
> sisters
> 03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;im_eof, Premature end of
> message from addr 172.16.9.60:15003
> 03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;sister could not
> communicate (15059) in 9065.cacau1.nec, job_start_error from node
> noco060.nec
> in job_start_error
> 03/30/2006 15:47:31;0008; pbs_mom;Req;send_sisters;sending ABORT to
> sisters
> 03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;job_start_error,
> job_start_error: sent 17 ABORT requests, should be 32
> 03/30/2006 15:47:31;0008; pbs_mom;Req;send_sisters;sending ABORT to
> sisters
> 03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;node_bailout,
> node_bailout: received KILL/ABORT request for job 9065.cacau1.nec from
> node noco
> 060.nec
> 03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;im_eof, Premature end of
> message from addr 172.16.9.63:15003
> 03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;sister could not
> communicate (15059) in 9065.cacau1.nec, job_start_error from node
> noco063.nec
> in job_start_error
> 03/30/2006 15:47:31;0008; pbs_mom;Req;send_sisters;sending ABORT to
> sisters
> 03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;job_start_error,
> job_start_error: sent 17 ABORT requests, should be 32
> 03/30/2006 15:47:31;0008; pbs_mom;Req;send_sisters;sending ABORT to
> sisters
> 03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;node_bailout,
> node_bailout: received KILL/ABORT request for job 9065.cacau1.nec from
> node noco
> 063.nec
> 03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;im_eof, Premature end of
> message from addr 172.16.9.64:15003
> 03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;sister could not
> communicate (15059) in 9065.cacau1.nec, job_start_error from node
> noco064.nec
> in job_start_error
> 03/30/2006 15:47:31;0008; pbs_mom;Req;send_sisters;sending ABORT to
> sisters
> 03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;job_start_error,
> job_start_error: sent 17 ABORT requests, should be 32
> 03/30/2006 15:47:31;0008; pbs_mom;Req;send_sisters;sending ABORT to
> sisters
> 03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;node_bailout,
> node_bailout: received KILL/ABORT request for job 9065.cacau1.nec from
> node noco
> 064.nec
> 03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;im_eof, Premature end of
> message from addr 172.16.9.65:15003
> 03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;sister could not
> communicate (15059) in 9065.cacau1.nec, job_start_error from node
> noco065.nec
>
> Then the node list changed to reverse order and the jobs starts properly
> after a few seconds.
>
> Any ideas what's going in that case? What can be the causes for a
> job_start_error?
>
> We are running Torque version 1.2.0p5.
I don't remember how long ago this was fixed, but modern TORQUE will
recover after the first error.
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060330/f9d3e18b/attachment.bin
More information about the torqueusers
mailing list