[torqueusers] Release of TORQUE 2.3.3

Josh Butikofer josh at clusterresources.com
Thu Aug 21 11:10:29 MDT 2008


Joshua Bernstein wrote:
> 
> 
> Josh Butikofer wrote:
>> Good question.
>>
>> There is a subtle difference. The command pbsdsh uses tm_spawn() to 
>> launch the new process. The new pbs_track command uses a new TM 
>> function called tm_adopt(). This function allows a MOM to start 
>> watching a process started by something other than the MOM. Any 
>> process in the system can be added to the MOM's session table and it 
>> will then send signals to that session, track resource usage, etc. as 
>> if the job was "born" from the pbs_mom in the normal tm_spawn() way.
> 
> Thats very slick. Though I imagine the pbs_mom is only able to track the 
> resource usage AFTER tm_adopt is called. So, I assume a proper 
> implementation would call tm_adopt() right after a fork()/exec(), though 
>  since the fork()/spawn()/adopt() isn't atomic the usage isn't perfect, 
> but its very likely close enough.

Yeah, of course tm_adopt() must be called after the process you want to track is running. You would 
usually do a fork() then call tm_adopt() then exec(). The new command pbs_track by default, however, 
does a tm_adopt() and then an exec(). This is because with SGI MPT, the array services must start 
the process that eventually becomes the job.

>> This is useful for some MPI libraries....
> 
> How about this case. Perhaps there is a case were a process is running, 
> and all of a sudden you want to include that processes usage in the 
> usage. Consider an application where there are daemons left running on a 
> compute nodes that preform some reason intensive task on behalf of the 
> job. (This is common with some commercial life science applications.) 
> Then when a job starts pbs_mom would fork the process, and through say a 
> prologue, tm_adopt is called on the daemons PID. This way the daemons' 
> CPU time is accurately accounted for in the job. This leads two a few 
> questions:
> 
> 1) Can an PID be adopted by more then one job at the same time?

Yeah ... this is theoretically possible. I haven't tested it though. :)

> 2) Is there something like tm_orphan() to un-adopt a PID?

No, there is not.


More information about the torqueusers mailing list