[Mauiusers] Standing reservations and MOM restarts - Bug?
ake.sandgren at hpc2n.umu.se
Thu Mar 29 00:15:10 MDT 2007
On Wed, 2007-03-28 at 15:19 -0700, Jay Srinivasan wrote:
> Garrick Staples wrote:
> > On Wed, Mar 28, 2007 at 12:16:16AM -0700, Jay Srinivasan alleged:
> >> Hi,
> >> In moab/MRes.c in the MNodeUpdateResExpression() routine (around line
> >> 4075 in Maui-3.2.6p19), the check for MaxTasks and TaskCount, which is
> >> if ((R->MaxTasks > 0) && (R->TaskCount >= R->MaxTasks)) continue;
> >> I think, will check to see if the task count for the SR is more than the
> >> SRMAXTASKS parameter and then continue to the next SR and not update the
> >> current SR with the node(s) in the RegExp under consideration.
> >> But, in Maui atleast, it does not seem that the SRMAXTAKS parameter is
> >> even honored (nor do setres or MResCreate() even take it as a
> >> parameter), and so it seems that MaxTasks is always zero in this case
> >> for SRs.
> >> Thus, everytime a pbs_mom is recycled, this routine ends up adding the
> >> node that just came up to the SR nodelist, whether the node was on the
> >> list originally or not. This results in the SR gradually growing in size.
> >> I think the fix for this is to simply check for a possible MaxTasks
> >> value of 0 as well, i.e.
> >> if ((R->MaxTasks >= 0) && (R->TaskCount >= R->MaxTasks)) continue;
> >> Could someone who has a better knowledge of Maui internals please
> >> confirm that this is the case or let me know if I am not correct?
> > I can't comment directly on the problem, but I can say that Maui doesn't
> > talk to pbs_mom and I can't think of any reason why restarting pbs_mom
> > could effect Maui.
> Yes, perhaps not directly. But Maui has to know how many MOMs are
> running and coordinate the node->SR mapping. So, when Maui does its
> periodic scan and figures out that a node which was down has become
> available again (either through Torque or PBSPro -- I have the problem
> under both), it goes through the MNodeUpdateResExpression() code path
> and tosses that node onto the SR nodelist always (whether or not the
> node was on the SR nodelist to begin with).
Yes i agree, we have seen this behaviour too. That's one reason we
stopped using SR's.
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ake at hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
More information about the mauiusers