[torqueusers] torque 2.3.3 not reporting CPU time used
Caird, Andrew J
acaird at umich.edu
Wed Sep 3 14:43:11 MDT 2008
Hi Josh,
It is indeed the last pid on the machine.
This also explains why some of the times we see this it "goes away".
Where can we get a fixed version or a patch to the 2.3.3?
Thanks a lot!
--andy
> -----Original Message-----
> From: Josh Butikofer [mailto:josh at clusterresources.com]
> Sent: Wednesday, September 03, 2008 2:14 PM
> To: Caird, Andrew J
> Cc: torqueusers at supercluster.org
> Subject: Re: [torqueusers] torque 2.3.3 not reporting CPU time used
>
> Andrew,
>
> Out of curiosity, is the process 28227 the last process listed on that
> machine (i.e. the last
> process to have been started in the system)? If so, I've found (and
> fixed) a bug where TORQUE looks
> at all processes but the very last one when summing up used resources
> like CPU time.
>
> --Josh B.
>
> Caird, Andrew J wrote:
> > Hello all,
> >
> > We've seen a few cases where the pbs_mom in Torque 2.3.3 doesn't
> report the CPU time.
> >
> > With the logging turned up to 8 for pbs_mom, I see:
> >
> > 09/03/2008 13:03:44;0080; pbs_mom;Svr;mom_get_sample;proc_array
> load started
> > 09/03/2008 13:03:44;0080; pbs_mom;n/a;mom_get_sample;proc_array
> loaded - nproc=134
> > 09/03/2008 13:03:44;0080; pbs_mom;n/a;cput_sum;proc_array loop
> start - jobid = 1437922.nyx.engin.umich.edu
> > 09/03/2008 13:03:44;0002; pbs_mom;n/a;cput_sum;cput_sum:
> session=28083 pid=28083 cputime=0 (cputfactor=1.000000)
> > 09/03/2008 13:03:44;0002; pbs_mom;n/a;cput_sum;cput_sum:
> session=28083 pid=28226 cputime=0 (cputfactor=1.000000)
> > 09/03/2008 13:03:44;0080; pbs_mom;n/a;mem_sum;proc_array loop start
> - jobid = 1437922.nyx.engin.umich.edu
> > 09/03/2008 13:03:44;0080; pbs_mom;n/a;resi_sum;proc_array loop
> start - jobid = 1437922.nyx.engin.umich.edu
> > 09/03/2008 13:03:44;0008; pbs_mom;Req;send_sisters;sending command
> POLL_JOB for job 1437922.nyx.engin.umich.edu (7)
> >
> > This is for a 4-task job on one node with no other tasks on this node
> - there are no other MOMs or jobs involved besides this one.
> >
> >
> > [root at node378 ~]# ps -ef | egrep PPID\|pbs_mom\|28083\|28226
> > UID PID PPID C STIME TTY TIME CMD
> > root 3999 1 0 Aug20 ? 00:02:16
> /usr/local/torque/sbin/pbs_mom -p
> > user1 28083 3999 0 Aug27 ? 00:00:00 -sh
> > user1 28226 28083 0 Aug27 ? 00:00:00 /bin/sh
> /var/spool/PBS/mom_priv/jobs/1437922.nyx.engin.umich.edu.SC
> > user1 28227 28226 99 Aug27 ? 7-01:43:28 ./tortusorMFPA6.out
> >
> > The proc_array seems to be looking at 2 PIDs (28083 and 28226 in this
> case) but not looking at the third related PID (28227, the child of
> 28226), which is the process that has all of the CPU time.
> >
> > Has anyone else noticed this? Am I even reporting useful
> information?
> >
> > Thanks.
> > --andy
> >
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list