[torqueusers] torque 2.3.3 not reporting CPU time used

Josh Butikofer josh at clusterresources.com
Wed Sep 3 14:55:40 MDT 2008


Andy,

A fix for this has already been rolled into the 2.3 branch and is slated for release in 2.3.4. You 
can use this snapshot 
(http://www.clusterresources.com/downloads/torque/snapshots/torque-2.3.4-snap.200809011357.tar.gz) 
which has the fix in it.

Also, I fixed this only for any Linux builds of TORQUE. Other non-Linux builds may still have a 
similar bug present. When we get the time, we are going to go through the other "archs" and fix this 
bug for all types of builds if needed.

--Josh B.

Caird, Andrew J wrote:
> Hi Josh,
> 
> It is indeed the last pid on the machine.
> 
> This also explains why some of the times we see this it "goes away".
> 
> Where can we get a fixed version or a patch to the 2.3.3?
> 
> Thanks a lot!
> 
> --andy
> 
> 
>> -----Original Message-----
>> From: Josh Butikofer [mailto:josh at clusterresources.com]
>> Sent: Wednesday, September 03, 2008 2:14 PM
>> To: Caird, Andrew J
>> Cc: torqueusers at supercluster.org
>> Subject: Re: [torqueusers] torque 2.3.3 not reporting CPU time used
>>
>> Andrew,
>>
>> Out of curiosity, is the process 28227 the last process listed on that
>> machine (i.e. the last
>> process to have been started in the system)? If so, I've found (and
>> fixed) a bug where TORQUE looks
>> at all processes but the very last one when summing up used resources
>> like CPU time.
>>
>> --Josh B.
>>
>> Caird, Andrew J wrote:
>>> Hello all,
>>>
>>> We've seen a few cases where the pbs_mom in Torque 2.3.3 doesn't
>> report the CPU time.
>>> With the logging turned up to 8 for pbs_mom, I see:
>>>
>>> 09/03/2008 13:03:44;0080;   pbs_mom;Svr;mom_get_sample;proc_array
>> load started
>>> 09/03/2008 13:03:44;0080;   pbs_mom;n/a;mom_get_sample;proc_array
>> loaded - nproc=134
>>> 09/03/2008 13:03:44;0080;   pbs_mom;n/a;cput_sum;proc_array loop
>> start - jobid = 1437922.nyx.engin.umich.edu
>>> 09/03/2008 13:03:44;0002;   pbs_mom;n/a;cput_sum;cput_sum:
>> session=28083 pid=28083 cputime=0 (cputfactor=1.000000)
>>> 09/03/2008 13:03:44;0002;   pbs_mom;n/a;cput_sum;cput_sum:
>> session=28083 pid=28226 cputime=0 (cputfactor=1.000000)
>>> 09/03/2008 13:03:44;0080;   pbs_mom;n/a;mem_sum;proc_array loop start
>> - jobid = 1437922.nyx.engin.umich.edu
>>> 09/03/2008 13:03:44;0080;   pbs_mom;n/a;resi_sum;proc_array loop
>> start - jobid = 1437922.nyx.engin.umich.edu
>>> 09/03/2008 13:03:44;0008;   pbs_mom;Req;send_sisters;sending command
>> POLL_JOB for job 1437922.nyx.engin.umich.edu (7)
>>> This is for a 4-task job on one node with no other tasks on this node
>> - there are no other MOMs or jobs involved besides this one.
>>>
>>> [root at node378 ~]# ps -ef | egrep PPID\|pbs_mom\|28083\|28226
>>> UID       PID  PPID  C STIME TTY          TIME CMD
>>> root     3999     1  0 Aug20 ?        00:02:16
>> /usr/local/torque/sbin/pbs_mom -p
>>> user1   28083  3999  0 Aug27 ?        00:00:00 -sh
>>> user1   28226 28083  0 Aug27 ?        00:00:00 /bin/sh
>> /var/spool/PBS/mom_priv/jobs/1437922.nyx.engin.umich.edu.SC
>>> user1   28227 28226 99 Aug27 ?        7-01:43:28 ./tortusorMFPA6.out
>>>
>>> The proc_array seems to be looking at 2 PIDs (28083 and 28226 in this
>> case) but not looking at the third related PID (28227, the child of
>> 28226), which is the process that has all of the CPU time.
>>> Has anyone else noticed this?  Am I even reporting useful
>> information?
>>> Thanks.
>>> --andy
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list