[torqueusers] Possible improvement for Linux mom scan_non_child_tasks() ?

Chris Samuel csamuel at vpac.org
Thu Apr 7 00:46:00 MDT 2005


Hi folks,

We've got a bizzare problem on a RHEL3 cluster running Torque that I can't get 
to the bottom of.

It's a re-emergence of the old problem where when you run a mom with the -p 
option it occasionally loses track of running jobs and declares them dead 
even though the process is still there.

This happens because for some unknown reason when the mom starts scanning 
through the /proc directory after doing a rewinddir() it does not start at 
the beginning of the directory structure, but partway down.

Of course if where it starts is *after* the process it is looking for it fails 
to find it and declares it to be dead at the end.

Looking at the code all it appears to do is trapse through the entries 
in /proc using get_proc_stat() trying to find processes that belong to a 
particular session.

However, my feeling is that as the session ID *should* be the users pbs script 
that is running then it would be better to just check whether the process ID 
for the session is still alive, no ?

This would reduce the code to just having to do a kill(sessionid,0); to see if 
its still there, meaning you only need to do a single system call per session 
rather than traversing /proc per session you're trying to find.

How far off the track am I here ?

cheers!
Chris
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050407/c637ee94/attachment.bin


More information about the torqueusers mailing list