[torquedev] [Bug 96] handle failed stagein jobs properly

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Fri Nov 5 18:38:34 MDT 2010


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=96

--- Comment #3 from Simon Toth <SimonT at mail.muni.cz> 2010-11-05 18:38:33 MDT ---
(In reply to comment #2)
> Why won't the job the enter the "H" held/hold state if it's working as
> designed, but stays in the "W"aiting status? Seems to me the node assignment
> should be released if this happens and now not it is not (probably because it
> stays in Waiting status?).

As far as I know, held status is only for queued jobs. It is used to determine
if a job is ready to be run. This is a different state.

Looking at the code, the jobs should release back into queued once the timeout
is off.

> Secondly, why hold it 30 minutes? A lot can happen in 30 minutes.

I don't know, it's just a preset (I consider it quite reasonable). Its a
compile time constant, you can easily change it if you don't like it.

> It would be nice to make that policy configurable. Like I said to either;
>  * delete/discard
>  * hold (really hold, not stay in waiting)
>  * hold for an configurable amount of time
> 
> Perhaps something like:
> 
> set server stagefail_policy = <hold|discard>
> set server stagefail_holdtimeout = 90

Yeah, configurable is always better.

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list