[torqueusers] Possible improvement for Linux mom
jacksond at clusterresources.com
Sat Apr 9 19:06:43 MDT 2005
I like the change. We use this same approach in several system tools
and it has worked on all platforms we have tested. If there are no
arguments from the community, we will make the change as an #ifdef in
the next release and if all works well from there, we will make it the
default in follow-on releases.
On Thu, 2005-04-07 at 16:46 +1000, Chris Samuel wrote:
> Hi folks,
> We've got a bizzare problem on a RHEL3 cluster running Torque that I can't get
> to the bottom of.
> It's a re-emergence of the old problem where when you run a mom with the -p
> option it occasionally loses track of running jobs and declares them dead
> even though the process is still there.
> This happens because for some unknown reason when the mom starts scanning
> through the /proc directory after doing a rewinddir() it does not start at
> the beginning of the directory structure, but partway down.
> Of course if where it starts is *after* the process it is looking for it fails
> to find it and declares it to be dead at the end.
> Looking at the code all it appears to do is trapse through the entries
> in /proc using get_proc_stat() trying to find processes that belong to a
> particular session.
> However, my feeling is that as the session ID *should* be the users pbs script
> that is running then it would be better to just check whether the process ID
> for the session is still alive, no ?
> This would reduce the code to just having to do a kill(sessionid,0); to see if
> its still there, meaning you only need to do a single system call per session
> rather than traversing /proc per session you're trying to find.
> How far off the track am I here ?
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers