[torquedev] [Bug 96] handle failed stagein jobs properly

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Fri Nov 5 08:43:09 MDT 2010


--- Comment #2 from ramon.bastiaans at sara.nl 2010-11-05 08:43:09 MDT ---
well, for one, we have mailing disabled by Torque using;
qmgr -c 'set server mail_domain = never'

So that email will never get sent to the end user. Furthermore, our end users
can not login to our system, we have no interactive (shell) access machine. So
they will never be able to retrieve that email.

Why won't the job the enter the "H" held/hold state if it's working as
designed, but stays in the "W"aiting status? Seems to me the node assignment
should be released if this happens and now not it is not (probably because it
stays in Waiting status?).

Secondly, why hold it 30 minutes? A lot can happen in 30 minutes.

It would be nice to make that policy configurable. Like I said to either;
 * delete/discard
 * hold (really hold, not stay in waiting)
 * hold for an configurable amount of time

Perhaps something like:

set server stagefail_policy = <hold|discard>
set server stagefail_holdtimeout = 90

Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

More information about the torquedev mailing list