[torqueusers] torque 2.3.3 not reporting CPU time used

Josh Butikofer josh at clusterresources.com
Wed Sep 3 12:13:58 MDT 2008


Andrew,

Out of curiosity, is the process 28227 the last process listed on that machine (i.e. the last 
process to have been started in the system)? If so, I've found (and fixed) a bug where TORQUE looks 
at all processes but the very last one when summing up used resources like CPU time.

--Josh B.

Caird, Andrew J wrote:
> Hello all,
> 
> We've seen a few cases where the pbs_mom in Torque 2.3.3 doesn't report the CPU time.
> 
> With the logging turned up to 8 for pbs_mom, I see:
> 
> 09/03/2008 13:03:44;0080;   pbs_mom;Svr;mom_get_sample;proc_array load started
> 09/03/2008 13:03:44;0080;   pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=134
> 09/03/2008 13:03:44;0080;   pbs_mom;n/a;cput_sum;proc_array loop start - jobid = 1437922.nyx.engin.umich.edu
> 09/03/2008 13:03:44;0002;   pbs_mom;n/a;cput_sum;cput_sum: session=28083 pid=28083 cputime=0 (cputfactor=1.000000)
> 09/03/2008 13:03:44;0002;   pbs_mom;n/a;cput_sum;cput_sum: session=28083 pid=28226 cputime=0 (cputfactor=1.000000)
> 09/03/2008 13:03:44;0080;   pbs_mom;n/a;mem_sum;proc_array loop start - jobid = 1437922.nyx.engin.umich.edu
> 09/03/2008 13:03:44;0080;   pbs_mom;n/a;resi_sum;proc_array loop start - jobid = 1437922.nyx.engin.umich.edu
> 09/03/2008 13:03:44;0008;   pbs_mom;Req;send_sisters;sending command POLL_JOB for job 1437922.nyx.engin.umich.edu (7)
> 
> This is for a 4-task job on one node with no other tasks on this node - there are no other MOMs or jobs involved besides this one.
> 
> 
> [root at node378 ~]# ps -ef | egrep PPID\|pbs_mom\|28083\|28226
> UID       PID  PPID  C STIME TTY          TIME CMD
> root     3999     1  0 Aug20 ?        00:02:16 /usr/local/torque/sbin/pbs_mom -p
> user1   28083  3999  0 Aug27 ?        00:00:00 -sh
> user1   28226 28083  0 Aug27 ?        00:00:00 /bin/sh /var/spool/PBS/mom_priv/jobs/1437922.nyx.engin.umich.edu.SC
> user1   28227 28226 99 Aug27 ?        7-01:43:28 ./tortusorMFPA6.out
> 
> The proc_array seems to be looking at 2 PIDs (28083 and 28226 in this case) but not looking at the third related PID (28227, the child of 28226), which is the process that has all of the CPU time.
> 
> Has anyone else noticed this?  Am I even reporting useful information?
> 
> Thanks.
> --andy
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list