[torqueusers] Strange job start

Garrick Staples garrick at usc.edu
Thu Mar 30 10:03:34 MST 2006


On Thu, Mar 30, 2006 at 04:26:26PM +0200, Danny Sternkopf alleged:
> Hi,
> 
> on our 200nodes cluster we sometimes see mulit-node jobs which need 
> almost 10minutes to start after the scheduler(Maui) has decided it.
> 
> The job has state running and the Mom on the first node got the batch 
> script and the job infos. With strace we could see that MOm is talking 
> to the other MOMs from this job and to the PBS server, but the prologue 
> and the batch script are not running yet.
> 
> After a couple of minutes the following messages appear in the MOM log 
> on the first node:
> 
> 03/30/2006 15:40:18;0008;   pbs_mom;Job;9065.cacau1.nec;Job Modified at 
> request of PBS_Server at cacau1.nec
> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;im_eof, Premature end of 
> message from addr 172.16.9.59:15003
> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;sister could not 
> communicate (15059) in 9065.cacau1.nec, job_start_error from node 
> noco059.nec
>  in job_start_error
> 03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
> sisters
> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;job_start_error, 
> job_start_error: sent 19 ABORT requests, should be 32
> 03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
> sisters
> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;im_eof, Premature end of 
> message from addr 172.16.9.60:15003
> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;sister could not 
> communicate (15059) in 9065.cacau1.nec, job_start_error from node 
> noco060.nec
>  in job_start_error
> 03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
> sisters
> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;job_start_error, 
> job_start_error: sent 17 ABORT requests, should be 32
> 03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
> sisters
> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;node_bailout, 
> node_bailout: received KILL/ABORT request for job 9065.cacau1.nec from 
> node noco
> 060.nec
> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;im_eof, Premature end of 
> message from addr 172.16.9.63:15003
> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;sister could not 
> communicate (15059) in 9065.cacau1.nec, job_start_error from node 
> noco063.nec
>  in job_start_error
> 03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
> sisters
> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;job_start_error, 
> job_start_error: sent 17 ABORT requests, should be 32
> 03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
> sisters
> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;node_bailout, 
> node_bailout: received KILL/ABORT request for job 9065.cacau1.nec from 
> node noco
> 063.nec
> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;im_eof, Premature end of 
> message from addr 172.16.9.64:15003
> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;sister could not 
> communicate (15059) in 9065.cacau1.nec, job_start_error from node 
> noco064.nec
>  in job_start_error
> 03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
> sisters
> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;job_start_error, 
> job_start_error: sent 17 ABORT requests, should be 32
> 03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
> sisters
> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;node_bailout, 
> node_bailout: received KILL/ABORT request for job 9065.cacau1.nec from 
> node noco
> 064.nec
> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;im_eof, Premature end of 
> message from addr 172.16.9.65:15003
> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;sister could not 
> communicate (15059) in 9065.cacau1.nec, job_start_error from node 
> noco065.nec
> 
> Then the node list changed to reverse order and the jobs starts properly 
> after a few seconds.
> 
> Any ideas what's going in that case? What can be the causes for a 
> job_start_error?
> 
> We are running Torque version 1.2.0p5.

I don't remember how long ago this was fixed, but modern TORQUE will
recover after the first error.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060330/f9d3e18b/attachment.bin


More information about the torqueusers mailing list