[torqueusers] Strange job start
Danny Sternkopf
dsternkopf at hpce.nec.com
Thu Mar 30 07:26:26 MST 2006
Hi,
on our 200nodes cluster we sometimes see mulit-node jobs which need
almost 10minutes to start after the scheduler(Maui) has decided it.
The job has state running and the Mom on the first node got the batch
script and the job infos. With strace we could see that MOm is talking
to the other MOMs from this job and to the PBS server, but the prologue
and the batch script are not running yet.
After a couple of minutes the following messages appear in the MOM log
on the first node:
03/30/2006 15:40:18;0008; pbs_mom;Job;9065.cacau1.nec;Job Modified at
request of PBS_Server at cacau1.nec
03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;im_eof, Premature end of
message from addr 172.16.9.59:15003
03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;sister could not
communicate (15059) in 9065.cacau1.nec, job_start_error from node
noco059.nec
in job_start_error
03/30/2006 15:47:31;0008; pbs_mom;Req;send_sisters;sending ABORT to
sisters
03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;job_start_error,
job_start_error: sent 19 ABORT requests, should be 32
03/30/2006 15:47:31;0008; pbs_mom;Req;send_sisters;sending ABORT to
sisters
03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;im_eof, Premature end of
message from addr 172.16.9.60:15003
03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;sister could not
communicate (15059) in 9065.cacau1.nec, job_start_error from node
noco060.nec
in job_start_error
03/30/2006 15:47:31;0008; pbs_mom;Req;send_sisters;sending ABORT to
sisters
03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;job_start_error,
job_start_error: sent 17 ABORT requests, should be 32
03/30/2006 15:47:31;0008; pbs_mom;Req;send_sisters;sending ABORT to
sisters
03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;node_bailout,
node_bailout: received KILL/ABORT request for job 9065.cacau1.nec from
node noco
060.nec
03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;im_eof, Premature end of
message from addr 172.16.9.63:15003
03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;sister could not
communicate (15059) in 9065.cacau1.nec, job_start_error from node
noco063.nec
in job_start_error
03/30/2006 15:47:31;0008; pbs_mom;Req;send_sisters;sending ABORT to
sisters
03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;job_start_error,
job_start_error: sent 17 ABORT requests, should be 32
03/30/2006 15:47:31;0008; pbs_mom;Req;send_sisters;sending ABORT to
sisters
03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;node_bailout,
node_bailout: received KILL/ABORT request for job 9065.cacau1.nec from
node noco
063.nec
03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;im_eof, Premature end of
message from addr 172.16.9.64:15003
03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;sister could not
communicate (15059) in 9065.cacau1.nec, job_start_error from node
noco064.nec
in job_start_error
03/30/2006 15:47:31;0008; pbs_mom;Req;send_sisters;sending ABORT to
sisters
03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;job_start_error,
job_start_error: sent 17 ABORT requests, should be 32
03/30/2006 15:47:31;0008; pbs_mom;Req;send_sisters;sending ABORT to
sisters
03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;node_bailout,
node_bailout: received KILL/ABORT request for job 9065.cacau1.nec from
node noco
064.nec
03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;im_eof, Premature end of
message from addr 172.16.9.65:15003
03/30/2006 15:47:31;0001; pbs_mom;Svr;pbs_mom;sister could not
communicate (15059) in 9065.cacau1.nec, job_start_error from node
noco065.nec
Then the node list changed to reverse order and the jobs starts properly
after a few seconds.
Any ideas what's going in that case? What can be the causes for a
job_start_error?
We are running Torque version 1.2.0p5.
Thanks for your help and Best regards,
Danny
More information about the torqueusers
mailing list