[torqueusers] 1.1.0p6 cpu time counter fails with very long jobs?

Garrick Staples garrick at usc.edu
Tue Feb 1 00:12:31 MST 2005


Do you have anything in your mom logs at that time?

On Mon, Jan 31, 2005 at 03:22:28PM +0200, Mikko Huhtala alleged:
> 
> We have allowed very long jobs on our cluster. For serial jobs, the
> resources_max.cput attribute is set to 4320:00:00 (6 months).
> 
> We just had two single-processor jobs exit unexpectedly after running
> for 24 days. Pbs_server log shows the following:
> 
> /var/spool/pbs/server_logs/20050130:01/30/2005 16:10:31;0010;PBS_Server;Job;214.volvox.abo.fi;Exit_status=0 resources_used.cput=5329:48:49 resources_used.mem=26752kb resources_used.vmem=42968kb resources_used.walltime=596:25:37
> /var/spool/pbs/server_logs/20050130:01/30/2005 19:43:13;0010;PBS_Server;Job;215.volvox.abo.fi;Exit_status=0 resources_used.cput=5329:48:57 resources_used.mem=23056kb resources_used.vmem=38932kb resources_used.walltime=599:46:15
> 
> 
> The jobs did not run to completion (despite Exit_status=0), but were
> apparently terminated by Torque.
> 
> For some reason, the cpu time is almost 10 times greater than the wall
> time. These are single-processor jobs and the log shows that they were
> placed in a queue that takes serial jobs only. It looks like the cpu
> time counter got corrupted or something.
> 
> Other, shorter jobs in the pbs_server log seem ok; the cpu time is
> approximately the same as the wall time for single-processor jobs, and
> the wall time multiplied by the number of processors for parallel
> jobs.
> 
> Torque version is 1.1.0p6-snap.1105139538 (we needed p6 before it was
> released and haven't bothered to update to the final version...)
> running on Fedora Core 3 (Linux 2.6.9) and dual Pentium 4 Xeon
> machines.
> 
> Mikko
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050131/eb7f787c/attachment.bin


More information about the torqueusers mailing list