[torqueusers] Re: Strange Torque Bug (with fix) - mom looses track of jobs it starts when run with -p option on RHEL3

jacksond at supercluster.org jacksond at supercluster.org
Thu Sep 16 09:27:48 MDT 2004


   Thanks for finding and correcting this.  The Torque 1.1.0 pre patch 2 
snapshot contains this change and the CHANGELOG has been updated.  Please 
let us know if you find anything further.


On Thu, 16 Sep 2004, Chris Samuel wrote:

> Hash: SHA1
> Hi folks,
> This is the wierdest thing, we're helping a bunch of folks set up a cluster
> based on Redhat Enterprise Linux 3 and we're been having a heap of problems
> with the pbs_mom loosing track of the processes it started when we use the -p
> option to tell it to inherit already running jobs.  This is not just a
> problem for jobs that it's inherited, it also affects jobs that it itself has
> just started (they usually get lost within a couple of minutes, if the
> machine becomes CPU bound).
> We don't seem to see this problem with the (slightly earlier) version we run
> on our Xeon RH7.3 or Opteron Fedora Core 2 based clusters here at VPAC.
> The symptom is a message in the mom logs saying:
> 09/15/2004 18:26:36;0008;   pbs_mom;Job;scan_non_child_tasks;found exited
> session 3340 for task 1 in job 92007.XXX
> Inspecting that code path at VPAC and at the problem cluster shows that the
> code for this section is pretty much identical.
> The problem is in this code fragment in scan_non_child_tasks() in
> src/resmom/linux/mom_mach.c :
>      rewinddir(vpacproc);
>      while ((dent = readdir(pdir)) != NULL)
> My debugging showed that occasionally the readdir() would start at the process
> *after* the one being looked at, i.e. the rewinddir() was apparently not
> working.
> Writing a little test program that only opendir()'d /proc and then looped
> (with a small sleep()) doing a rewinddir() and readdir() at the same time as
> the pbs_mom was running with my debug log messages showed that the /proc
> directory looked fine when this happened.
> Looking further at the code showed that pdir is a global variable in the file
> of type DIR * and is used in a number of other functions (all with a
> rewinddir() in front of them).
> In the end I took the step of changing the scan_non_child_tasks() to use a
> local variable scoped to the function with an opendir() at the start and a
> closedir() just before the return() and that seems to have fixed it.
> My speculation is that this possibly has something to do with the fact that
> RHEL3 has NPTL backported from 2.6 to its 2.4 kernel and glibc libraries, and
> that somehow (prompted by a connection from the pbs_server I think) a race
> condition exists where pdir gets changed by another function in between the
> rewinddir() and the readdir().
> I've attached a diff between the (hopefully) fixed function and the old one.
> Of course, with a number of other functions using the same pdir global
> variable I guess it's possible for this race condition to reoccur somewhere
> else in the mom too!
> cheers,
> Chris
> - --
> Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
> Victorian Partnership for Advanced Computing http://www.vpac.org/
> Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
> Version: GnuPG v1.2.4 (GNU/Linux)
> 3SsWZeeRhE5cuFinsQu8MoM=
> =0cf/

More information about the torqueusers mailing list