Wickliffe, Blake W blake.wickliffe at aramco.com
Sat May 17 00:29:57 MDT 2008


We are seeing a strange issue with Maui interacting with Torque.  It has to do with qalter'ing the resource requirements of a job.

The scenario is like this:  We have a very large pool of CPU's that we can run jobs on, and this pool is heterogeneous.  The older, slower, less reliable processors we keep in reserve for when we get a large backlog of jobs.  So, as an example, a user submits a job and asks for 100 "fast" processors.  Doing a checkjob on his queued job, you'd see something like:

Opsys: [NONE]  Arch: [NONE]  Features: [fast]

If the administrator sees a large backlog, he can opt (at the users' request) to manually move some jobs to the "slow" nodes.  He does this in the normal way with a qalter command.  However, what you see from checkjob is something like this:

Opsys: [NONE]  Arch: [NONE]  Features: [fast][slow]

So, somehow Maui is retaining the old resource requirements of the job and adding the new requirements.  If you cycle Maui, you get:

Opsys: [NONE]  Arch: [NONE]  Features: [slow]

And everything works as one would expect.  We'd like to avoid having to cycle Maui every time we need to do something like this.  Is anyone else seeing this issue?

Vital stats:

Maui version 3.2.6p19
Torque version: 2.3.0


Blake Wickliffe
Saudi Aramco

