[torqueusers] Strange job start

Danny Sternkopf dsternkopf at hpce.nec.com
Fri Mar 31 02:09:22 MST 2006


Hi,

which version is required then? 1.2.0p6 or already 2?
Is version 2 compatible to version 1?

Thanks and Best regards,

Danny

Garrick Staples wrote:
> On Thu, Mar 30, 2006 at 04:26:26PM +0200, Danny Sternkopf alleged:
>> Hi,
>>
>> on our 200nodes cluster we sometimes see mulit-node jobs which need 
>> almost 10minutes to start after the scheduler(Maui) has decided it.
>>
>> The job has state running and the Mom on the first node got the batch 
>> script and the job infos. With strace we could see that MOm is talking 
>> to the other MOMs from this job and to the PBS server, but the prologue 
>> and the batch script are not running yet.
>>
>> After a couple of minutes the following messages appear in the MOM log 
>> on the first node:
>>
>> 03/30/2006 15:40:18;0008;   pbs_mom;Job;9065.cacau1.nec;Job Modified at 
>> request of PBS_Server at cacau1.nec
>> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;im_eof, Premature end of 
>> message from addr 172.16.9.59:15003
>> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;sister could not 
>> communicate (15059) in 9065.cacau1.nec, job_start_error from node 
>> noco059.nec
>>  in job_start_error
>> 03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
>> sisters
>> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;job_start_error, 
>> job_start_error: sent 19 ABORT requests, should be 32
>> 03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
>> sisters
>> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;im_eof, Premature end of 
>> message from addr 172.16.9.60:15003
>> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;sister could not 
>> communicate (15059) in 9065.cacau1.nec, job_start_error from node 
>> noco060.nec
>>  in job_start_error
>> 03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
>> sisters
>> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;job_start_error, 
>> job_start_error: sent 17 ABORT requests, should be 32
>> 03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
>> sisters
>> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;node_bailout, 
>> node_bailout: received KILL/ABORT request for job 9065.cacau1.nec from 
>> node noco
>> 060.nec
>> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;im_eof, Premature end of 
>> message from addr 172.16.9.63:15003
>> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;sister could not 
>> communicate (15059) in 9065.cacau1.nec, job_start_error from node 
>> noco063.nec
>>  in job_start_error
>> 03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
>> sisters
>> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;job_start_error, 
>> job_start_error: sent 17 ABORT requests, should be 32
>> 03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
>> sisters
>> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;node_bailout, 
>> node_bailout: received KILL/ABORT request for job 9065.cacau1.nec from 
>> node noco
>> 063.nec
>> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;im_eof, Premature end of 
>> message from addr 172.16.9.64:15003
>> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;sister could not 
>> communicate (15059) in 9065.cacau1.nec, job_start_error from node 
>> noco064.nec
>>  in job_start_error
>> 03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
>> sisters
>> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;job_start_error, 
>> job_start_error: sent 17 ABORT requests, should be 32
>> 03/30/2006 15:47:31;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
>> sisters
>> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;node_bailout, 
>> node_bailout: received KILL/ABORT request for job 9065.cacau1.nec from 
>> node noco
>> 064.nec
>> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;im_eof, Premature end of 
>> message from addr 172.16.9.65:15003
>> 03/30/2006 15:47:31;0001;   pbs_mom;Svr;pbs_mom;sister could not 
>> communicate (15059) in 9065.cacau1.nec, job_start_error from node 
>> noco065.nec
>>
>> Then the node list changed to reverse order and the jobs starts properly 
>> after a few seconds.
>>
>> Any ideas what's going in that case? What can be the causes for a 
>> job_start_error?
>>
>> We are running Torque version 1.2.0p5.
> 
> I don't remember how long ago this was fixed, but modern TORQUE will
> recover after the first error.
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-- 
Danny Sternkopf                         dsternkopf at hpce.nec.com
High Performance Computing Europe GmbH,Service & Delivery Group
Stuttgart, Germany phone: +49-711-68770-35 fax: +49-711-6877145


More information about the torqueusers mailing list