[torqueusers] Possible improvement for Linux mom
jacksond at clusterresources.com
Thu Apr 21 12:02:19 MDT 2005
The latest TORQUE 1.2.0p3 snapshot contains the requested change and
at this point, it is not #ifdef'd. The kill 0 approach will be used and
is this fails, the /proc search will be conducted. Note that only the
scan_non_child_tasks() routine has been modified at this point.
Please let us know what you find.
On Thu, 2005-04-07 at 16:46 +1000, Chris Samuel wrote:
> Hi folks,
> We've got a bizzare problem on a RHEL3 cluster running Torque that I can't get
> to the bottom of.
> It's a re-emergence of the old problem where when you run a mom with the -p
> option it occasionally loses track of running jobs and declares them dead
> even though the process is still there.
> This happens because for some unknown reason when the mom starts scanning
> through the /proc directory after doing a rewinddir() it does not start at
> the beginning of the directory structure, but partway down.
> Of course if where it starts is *after* the process it is looking for it fails
> to find it and declares it to be dead at the end.
> Looking at the code all it appears to do is trapse through the entries
> in /proc using get_proc_stat() trying to find processes that belong to a
> particular session.
> However, my feeling is that as the session ID *should* be the users pbs script
> that is running then it would be better to just check whether the process ID
> for the session is still alive, no ?
> This would reduce the code to just having to do a kill(sessionid,0); to see if
> its still there, meaning you only need to do a single system call per session
> rather than traversing /proc per session you're trying to find.
> How far off the track am I here ?
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers