[torqueusers] Wrong cput value

Kevin Murphy murphy at genome.chop.edu
Wed Jul 23 08:28:17 MDT 2008


Brock Palen wrote:
> Its not a bug, it happens consistently.  Some codes make processes 
> that are not children of the mom.  If its not pbs cant keep track.
> I think there is a different problem with something else, that cause 
> PBS to lose track.
>
Yes, I can understand that.  My 'bug' comment was just a hypothesis 
based on the observation that the 563 jobs executed the exact same code 
(on different data), and only 3 of the jobs had invalid cput values.

-Kevin

> On Jul 23, 2008, at 10:03 AM, Kevin Murphy wrote:
>> Brock Palen wrote:
>>> Where these jobs differnt code?
>>> Some code (hfss comes to mind)
>>> forks the real process and somehow torque looses track of it.  So 
>>> cput will almost be zero.
>>> Other options if your using parallel code the user is not using a tm 
>>> enabled mpirun.
>>>
>> The jobs use identical code, which happens to be a Perl wrapper 
>> around a command-line java program, invoked via system().  So you're 
>> suggesting that Torque might under rare circumstances (because of 
>> some bug?) fail to account for the CPU time of the child processes 
>> such as the perl-forked shell and shell-forked java process ....  
>> Hmmm.   So in general if a job invokes anything (?) which might fork, 
>> the cput value should be treated with suspicion.  Too bad.
>>>
>>> On Jul 22, 2008, at 2:39 PM, Kevin Murphy wrote:
>>>> I recently ran tracejob to compare runtime versus data-size 
>>>> statistics on 563 jobs, and three of them had impossibly low 
>>>> resources_used.cput values.



More information about the torqueusers mailing list