[torquedev] [Bug 96] handle failed stagein jobs properly
bugzilla-daemon at supercluster.org
bugzilla-daemon at supercluster.org
Fri Nov 5 18:38:34 MDT 2010
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=96
--- Comment #3 from Simon Toth <SimonT at mail.muni.cz> 2010-11-05 18:38:33 MDT ---
(In reply to comment #2)
> Why won't the job the enter the "H" held/hold state if it's working as
> designed, but stays in the "W"aiting status? Seems to me the node assignment
> should be released if this happens and now not it is not (probably because it
> stays in Waiting status?).
As far as I know, held status is only for queued jobs. It is used to determine
if a job is ready to be run. This is a different state.
Looking at the code, the jobs should release back into queued once the timeout
is off.
> Secondly, why hold it 30 minutes? A lot can happen in 30 minutes.
I don't know, it's just a preset (I consider it quite reasonable). Its a
compile time constant, you can easily change it if you don't like it.
> It would be nice to make that policy configurable. Like I said to either;
> * delete/discard
> * hold (really hold, not stay in waiting)
> * hold for an configurable amount of time
>
> Perhaps something like:
>
> set server stagefail_policy = <hold|discard>
> set server stagefail_holdtimeout = 90
Yeah, configurable is always better.
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
More information about the torquedev
mailing list