[torqueusers] torque 2.3.3 not reporting CPU time used
Josh Butikofer
josh at clusterresources.com
Wed Sep 3 14:55:40 MDT 2008
Andy,
A fix for this has already been rolled into the 2.3 branch and is slated for release in 2.3.4. You
can use this snapshot
(http://www.clusterresources.com/downloads/torque/snapshots/torque-2.3.4-snap.200809011357.tar.gz)
which has the fix in it.
Also, I fixed this only for any Linux builds of TORQUE. Other non-Linux builds may still have a
similar bug present. When we get the time, we are going to go through the other "archs" and fix this
bug for all types of builds if needed.
--Josh B.
Caird, Andrew J wrote:
> Hi Josh,
>
> It is indeed the last pid on the machine.
>
> This also explains why some of the times we see this it "goes away".
>
> Where can we get a fixed version or a patch to the 2.3.3?
>
> Thanks a lot!
>
> --andy
>
>
>> -----Original Message-----
>> From: Josh Butikofer [mailto:josh at clusterresources.com]
>> Sent: Wednesday, September 03, 2008 2:14 PM
>> To: Caird, Andrew J
>> Cc: torqueusers at supercluster.org
>> Subject: Re: [torqueusers] torque 2.3.3 not reporting CPU time used
>>
>> Andrew,
>>
>> Out of curiosity, is the process 28227 the last process listed on that
>> machine (i.e. the last
>> process to have been started in the system)? If so, I've found (and
>> fixed) a bug where TORQUE looks
>> at all processes but the very last one when summing up used resources
>> like CPU time.
>>
>> --Josh B.
>>
>> Caird, Andrew J wrote:
>>> Hello all,
>>>
>>> We've seen a few cases where the pbs_mom in Torque 2.3.3 doesn't
>> report the CPU time.
>>> With the logging turned up to 8 for pbs_mom, I see:
>>>
>>> 09/03/2008 13:03:44;0080; pbs_mom;Svr;mom_get_sample;proc_array
>> load started
>>> 09/03/2008 13:03:44;0080; pbs_mom;n/a;mom_get_sample;proc_array
>> loaded - nproc=134
>>> 09/03/2008 13:03:44;0080; pbs_mom;n/a;cput_sum;proc_array loop
>> start - jobid = 1437922.nyx.engin.umich.edu
>>> 09/03/2008 13:03:44;0002; pbs_mom;n/a;cput_sum;cput_sum:
>> session=28083 pid=28083 cputime=0 (cputfactor=1.000000)
>>> 09/03/2008 13:03:44;0002; pbs_mom;n/a;cput_sum;cput_sum:
>> session=28083 pid=28226 cputime=0 (cputfactor=1.000000)
>>> 09/03/2008 13:03:44;0080; pbs_mom;n/a;mem_sum;proc_array loop start
>> - jobid = 1437922.nyx.engin.umich.edu
>>> 09/03/2008 13:03:44;0080; pbs_mom;n/a;resi_sum;proc_array loop
>> start - jobid = 1437922.nyx.engin.umich.edu
>>> 09/03/2008 13:03:44;0008; pbs_mom;Req;send_sisters;sending command
>> POLL_JOB for job 1437922.nyx.engin.umich.edu (7)
>>> This is for a 4-task job on one node with no other tasks on this node
>> - there are no other MOMs or jobs involved besides this one.
>>>
>>> [root at node378 ~]# ps -ef | egrep PPID\|pbs_mom\|28083\|28226
>>> UID PID PPID C STIME TTY TIME CMD
>>> root 3999 1 0 Aug20 ? 00:02:16
>> /usr/local/torque/sbin/pbs_mom -p
>>> user1 28083 3999 0 Aug27 ? 00:00:00 -sh
>>> user1 28226 28083 0 Aug27 ? 00:00:00 /bin/sh
>> /var/spool/PBS/mom_priv/jobs/1437922.nyx.engin.umich.edu.SC
>>> user1 28227 28226 99 Aug27 ? 7-01:43:28 ./tortusorMFPA6.out
>>>
>>> The proc_array seems to be looking at 2 PIDs (28083 and 28226 in this
>> case) but not looking at the third related PID (28227, the child of
>> 28226), which is the process that has all of the CPU time.
>>> Has anyone else noticed this? Am I even reporting useful
>> information?
>>> Thanks.
>>> --andy
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list