Bugzilla – Bug 51
Fix for r3466 commit
Last modified: 2010-03-29 15:43:24 MDT
You need to log in before you can comment on or make changes to this bug.
r3466 commit completely breaks the server (on default setup), here is a quick bug fix. The commit itself should be probably fixed, but that would would require a lot of work. Specifically when taking into account foreign jobs (have to be handled specifically), job moving (custom job id has to be translated into one, that is recognized globally), etc.... Here is a quick patch for now.
Created an attachment (id=23) [details] bug fix
When you say completely breaks the server on default setup, what do you mean? I am able to checkout a clean branch, run the setup, make a nodes file, and run jobs. I appreciate the optimizations in your patch, and I am planning on checking in a version of it today (I'll post the patch here and explain any differences) As far as the limitations of this feature, we are aware of them. The idea was to create this feature and document its restrictions instead of making it completely robust. This is the idea because we only know of one user that wants to use it. If it turns out there are others that want this feature and it becomes more widespread, we can definitely make it more robust and capable.
Simon, There's one consideration I want to talk to you about before commiting a patch: Your patch changes things so that the get_correct_jobname function isn't called if the settings haven't been changed. My only concern is that if a user sets this, runs a job, and then unsets it, that job may hang in a running state forever because the server won't always recognize the job's name in the obit. For me, this is potentially acceptable, it just needs to be considered. We can instruct people that these settings cannot be changed while jobs are queued or running. I don't think this would inconvenience anyone much as these settings shouldn't need to be changed very often, and since most users (I'm guessing) aren't going to use this feature, and this would affect people that aren't using it less. At the same time, this isn't going to add much time to job lookup, since job lookup has to scroll through all of the jobs known to that server, and this just adds a function call and some checking. I'm open to both possibilities. Please let me know your thoughts.
> When you say completely breaks the server on default setup, what do you mean? I > am able to checkout a clean branch, run the setup, make a nodes file, and run > jobs. Well, I can't run any jobs. By default (nothing is set) the Job ID stored with the job is the full hostname one: jobid.servername.hostname but when nothing is set the get_correct_jobname returns jobid.servername (string comparison then simply fails). It may work if the machine you are testing it on does not have a FQDN. > At the same time, this isn't going to add much time to job lookup, since job > lookup has to scroll through all of the jobs known to that server, and this > just adds a function call and some checking. I'm open to both possibilities. > Please let me know your thoughts. Torque is not CPU bound, the performance limitations are elsewhere. Well, I have a big problem with this feature, because apart from the fact, that I can't run any jobs on the server by default it isn't compatible with many other features I implemented into Torque in the last year. If the feature would be part of Trunk, or if it would be optional (configure flag), then I would be much more happy.
putting this new feature in 2.4-fixes makes me nervous. It feels like it still has some kinks to be worked out. I would feel a lot better if it went into trunk, and it could be backported to 2.4 once it is super solid (if the community does not have any reservations). I know that this was due to a request for a specific user. They can use a patched 2.4 or a 2.5 snapshot to get the feature, but I think we should wait to push this out to everyone using a production release. At the very least it should be #ifdef'd out by default.
Created an attachment (id=24) [details] Patch to correct these issues This patch will make it so that the function to find the correct jobname is never called if the variables are not set. The difference between this fix and the original is that it accounts for both parameters, not just display_job_server_suffix. In the future, we will be more strict about what things can be placed in fix branches.
I believe this issue is now taken care of.