[torqueusers] Possible improvement for Linux mom
scan_non_child_tasks() ?
Chris Samuel
csamuel at vpac.org
Thu Apr 7 00:46:00 MDT 2005
Hi folks,
We've got a bizzare problem on a RHEL3 cluster running Torque that I can't get
to the bottom of.
It's a re-emergence of the old problem where when you run a mom with the -p
option it occasionally loses track of running jobs and declares them dead
even though the process is still there.
This happens because for some unknown reason when the mom starts scanning
through the /proc directory after doing a rewinddir() it does not start at
the beginning of the directory structure, but partway down.
Of course if where it starts is *after* the process it is looking for it fails
to find it and declares it to be dead at the end.
Looking at the code all it appears to do is trapse through the entries
in /proc using get_proc_stat() trying to find processes that belong to a
particular session.
However, my feeling is that as the session ID *should* be the users pbs script
that is running then it would be better to just check whether the process ID
for the session is still alive, no ?
This would reduce the code to just having to do a kill(sessionid,0); to see if
its still there, meaning you only need to do a single system call per session
rather than traversing /proc per session you're trying to find.
How far off the track am I here ?
cheers!
Chris
--
Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050407/c637ee94/attachment.bin
More information about the torqueusers
mailing list