[torqueusers] Possible improvement for Linux mom scan_non_child_tasks() ?

David Jackson jacksond at clusterresources.com
Thu Apr 21 12:02:19 MDT 2005


  The latest TORQUE 1.2.0p3 snapshot contains the requested change and
at this point, it is not #ifdef'd.  The kill 0 approach will be used and
is this fails, the /proc search will be conducted.  Note that only the
scan_non_child_tasks() routine has been modified at this point.

  Please let us know what you find.


On Thu, 2005-04-07 at 16:46 +1000, Chris Samuel wrote:
> Hi folks,
> We've got a bizzare problem on a RHEL3 cluster running Torque that I can't get 
> to the bottom of.
> It's a re-emergence of the old problem where when you run a mom with the -p 
> option it occasionally loses track of running jobs and declares them dead 
> even though the process is still there.
> This happens because for some unknown reason when the mom starts scanning 
> through the /proc directory after doing a rewinddir() it does not start at 
> the beginning of the directory structure, but partway down.
> Of course if where it starts is *after* the process it is looking for it fails 
> to find it and declares it to be dead at the end.
> Looking at the code all it appears to do is trapse through the entries 
> in /proc using get_proc_stat() trying to find processes that belong to a 
> particular session.
> However, my feeling is that as the session ID *should* be the users pbs script 
> that is running then it would be better to just check whether the process ID 
> for the session is still alive, no ?
> This would reduce the code to just having to do a kill(sessionid,0); to see if 
> its still there, meaning you only need to do a single system call per session 
> rather than traversing /proc per session you're trying to find.
> How far off the track am I here ?
> cheers!
> Chris
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers

More information about the torqueusers mailing list