[torqueusers] Wrong cput value
murphy at genome.chop.edu
Wed Jul 23 08:28:17 MDT 2008
Brock Palen wrote:
> Its not a bug, it happens consistently. Some codes make processes
> that are not children of the mom. If its not pbs cant keep track.
> I think there is a different problem with something else, that cause
> PBS to lose track.
Yes, I can understand that. My 'bug' comment was just a hypothesis
based on the observation that the 563 jobs executed the exact same code
(on different data), and only 3 of the jobs had invalid cput values.
> On Jul 23, 2008, at 10:03 AM, Kevin Murphy wrote:
>> Brock Palen wrote:
>>> Where these jobs differnt code?
>>> Some code (hfss comes to mind)
>>> forks the real process and somehow torque looses track of it. So
>>> cput will almost be zero.
>>> Other options if your using parallel code the user is not using a tm
>>> enabled mpirun.
>> The jobs use identical code, which happens to be a Perl wrapper
>> around a command-line java program, invoked via system(). So you're
>> suggesting that Torque might under rare circumstances (because of
>> some bug?) fail to account for the CPU time of the child processes
>> such as the perl-forked shell and shell-forked java process ....
>> Hmmm. So in general if a job invokes anything (?) which might fork,
>> the cput value should be treated with suspicion. Too bad.
>>> On Jul 22, 2008, at 2:39 PM, Kevin Murphy wrote:
>>>> I recently ran tracejob to compare runtime versus data-size
>>>> statistics on 563 jobs, and three of them had impossibly low
>>>> resources_used.cput values.
More information about the torqueusers