[torqueusers] Problem with torque's connection timeout handling.

Ake Ake.Sandgren at hpc2n.umu.se
Thu Nov 4 10:14:45 MST 2004


On Thu, Nov 04, 2004 at 10:31:39AM -0700, Wightman wrote:
> We have been working with a few other sites regarding this problem.  We
> are looking for the cleanest solution.  
> 
> Would it be useful to have a parameter called MINRESTARTTIME which would
> be the time Moab would wait before it tries to start the job again? 
> This could give the resource manager time to start the job up
> successfully and eventually report it back to Maui.
> 
> If this sounds helpful, please let us know and we will implement that
> parameter in Maui/Moab.

I don't think that will help, at leasts not while using pbs_runjob.
(Haven't check what a change to pbs_asyrunjob would do)

If PBSD_Commit doesn't wait for the mom reply the pbs_server will think
that the job is queued not running and when the Obit suddenly comes in
before the MINRESTARTTIME it will think that the Obit is in error and
things will get weird after that.

If i force PBSD_Commit to wait until the mom replies i get the following
chain of messages.
(This was done on a 2 node system, one pbs_server, one pbs_mom, one job)

Maui		pbs_server	pbs_serv_fork	pbs_mom
runjob ->	prepare
		fork ->		prepare
		return to
		main loop
				PBSD_Commit ->	start job
						...
						...
rdrpy timeout
				rdrpy	    <-	send rply
				exit(0)
		sigchld
		wake_task
(no receive) <-	send_reply
...
Start new iteration
ClusterQuery
statnode ->	prepare

rdrpy (which now gets the runjob reply from above)
(which is a CHOICE_NULL)
so ClusterQuery fails
WorkloadQuery
Sees that it needs a PBSInitialize
PBSInitialise
pbs_disconnect -> 
gets (or not?) data from statnode reply
Here maui stalls forever


To me this looks like, "scrap all comm code and start from scratch".
The communication state machine is simply wrong (or doesn't exist in the
first place)

The assumption that a timeout is a failure is not a good one.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ake at hpc2n.umu.se	Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


More information about the torqueusers mailing list