[torqueusers] Strange job start

Danny Sternkopf dsternkopf at hpce.nec.com
Thu Mar 30 07:26:26 MST 2006


Hi,

on our 200nodes cluster we sometimes see mulit-node jobs which need 
almost 10minutes to start after the scheduler(Maui) has decided it.

The job has state running and the Mom on the first node got the batch 
script and the job infos. With strace we could see that MOm is talking 
to the other MOMs from this job and to the PBS server, but the prologue 
and the batch script are not running yet.

After a couple of minutes the following messages appear in the MOM log 
on the first node:

03/30/2006 15:40:18;0008;   pbs_mom;Job;9065.cacau1.nec;Job Modified at 
request of PBS_Server at cacau1.nec
03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;im_eof, Premature end of 
message from addr 172.16.9.59:15003
03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;sister could not 
communicate (15059) in 9065.cacau1.nec, job_start_error from node 
noco059.nec
  in job_start_error
03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
sisters
03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;job_start_error, 
job_start_error: sent 19 ABORT requests, should be 32
03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
sisters
03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;im_eof, Premature end of 
message from addr 172.16.9.60:15003
03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;sister could not 
communicate (15059) in 9065.cacau1.nec, job_start_error from node 
noco060.nec
  in job_start_error
03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
sisters
03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;job_start_error, 
job_start_error: sent 17 ABORT requests, should be 32
03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
sisters
03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;node_bailout, 
node_bailout: received KILL/ABORT request for job 9065.cacau1.nec from 
node noco
060.nec
03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;im_eof, Premature end of 
message from addr 172.16.9.63:15003
03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;sister could not 
communicate (15059) in 9065.cacau1.nec, job_start_error from node 
noco063.nec
  in job_start_error
03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
sisters
03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;job_start_error, 
job_start_error: sent 17 ABORT requests, should be 32
03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
sisters
03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;node_bailout, 
node_bailout: received KILL/ABORT request for job 9065.cacau1.nec from 
node noco
063.nec
03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;im_eof, Premature end of 
message from addr 172.16.9.64:15003
03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;sister could not 
communicate (15059) in 9065.cacau1.nec, job_start_error from node 
noco064.nec
  in job_start_error
03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
sisters
03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;job_start_error, 
job_start_error: sent 17 ABORT requests, should be 32
03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
sisters
03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;node_bailout, 
node_bailout: received KILL/ABORT request for job 9065.cacau1.nec from 
node noco
064.nec
03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;im_eof, Premature end of 
message from addr 172.16.9.65:15003
03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;sister could not 
communicate (15059) in 9065.cacau1.nec, job_start_error from node 
noco065.nec

Then the node list changed to reverse order and the jobs starts properly 
after a few seconds.

Any ideas what's going in that case? What can be the causes for a 
job_start_error?

We are running Torque version 1.2.0p5.

Thanks for your help and Best regards,

Danny


More information about the torqueusers mailing list