[torqueusers] torque 2.3.3 not reporting CPU time used
Josh Butikofer
josh at clusterresources.com
Wed Sep 3 12:13:58 MDT 2008
Andrew,
Out of curiosity, is the process 28227 the last process listed on that machine (i.e. the last
process to have been started in the system)? If so, I've found (and fixed) a bug where TORQUE looks
at all processes but the very last one when summing up used resources like CPU time.
--Josh B.
Caird, Andrew J wrote:
> Hello all,
>
> We've seen a few cases where the pbs_mom in Torque 2.3.3 doesn't report the CPU time.
>
> With the logging turned up to 8 for pbs_mom, I see:
>
> 09/03/2008 13:03:44;0080; pbs_mom;Svr;mom_get_sample;proc_array load started
> 09/03/2008 13:03:44;0080; pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=134
> 09/03/2008 13:03:44;0080; pbs_mom;n/a;cput_sum;proc_array loop start - jobid = 1437922.nyx.engin.umich.edu
> 09/03/2008 13:03:44;0002; pbs_mom;n/a;cput_sum;cput_sum: session=28083 pid=28083 cputime=0 (cputfactor=1.000000)
> 09/03/2008 13:03:44;0002; pbs_mom;n/a;cput_sum;cput_sum: session=28083 pid=28226 cputime=0 (cputfactor=1.000000)
> 09/03/2008 13:03:44;0080; pbs_mom;n/a;mem_sum;proc_array loop start - jobid = 1437922.nyx.engin.umich.edu
> 09/03/2008 13:03:44;0080; pbs_mom;n/a;resi_sum;proc_array loop start - jobid = 1437922.nyx.engin.umich.edu
> 09/03/2008 13:03:44;0008; pbs_mom;Req;send_sisters;sending command POLL_JOB for job 1437922.nyx.engin.umich.edu (7)
>
> This is for a 4-task job on one node with no other tasks on this node - there are no other MOMs or jobs involved besides this one.
>
>
> [root at node378 ~]# ps -ef | egrep PPID\|pbs_mom\|28083\|28226
> UID PID PPID C STIME TTY TIME CMD
> root 3999 1 0 Aug20 ? 00:02:16 /usr/local/torque/sbin/pbs_mom -p
> user1 28083 3999 0 Aug27 ? 00:00:00 -sh
> user1 28226 28083 0 Aug27 ? 00:00:00 /bin/sh /var/spool/PBS/mom_priv/jobs/1437922.nyx.engin.umich.edu.SC
> user1 28227 28226 99 Aug27 ? 7-01:43:28 ./tortusorMFPA6.out
>
> The proc_array seems to be looking at 2 PIDs (28083 and 28226 in this case) but not looking at the third related PID (28227, the child of 28226), which is the process that has all of the CPU time.
>
> Has anyone else noticed this? Am I even reporting useful information?
>
> Thanks.
> --andy
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list