[torqueusers] Problem with PBS killing jobs that exceed cput limit

Fernando Malick fmalick at yahoo.com.ar
Wed Nov 7 09:47:48 MST 2012


Dear Sirs: I'm having a problem with torque/maui. Some long jobs are being killed by PBS because the reported cput exceedes the established limit
However, the cput value reported is impossible.

This is what I get on the email report:

Exit_status=143
resources_used.cput=2562047:47:16
resources_used.mem=776112kb
resources_used.vmem=5000268kb
resources_used.walltime=49:48:37


This is what appears in the log file of the job:
=>> PBS: job killed: cput 9223372036 exceeded limit 15552000
2.4.8
The limit is quite generous
15552000 seconds --> 4320 hours --> 180 days

Number of CPUs (cores) used for this job: 12

walltime of the job when killed: 49*3600+48*60+37 = 179317 sec

Considering all 12 cores together, that would give 12x179317 = 2151804 sec of total cput and not the value reported 9223372036 which is exremely large.

Obviously something is happening, probably with pbs_mom, but don't know what.

We are using torque 2.5.2 (compiled from scratch), but on some of the nodes we have torque-mom (debian package with pbs-mom) version 2.4.8
Also there's another difference in equipments: the torque/maui server is Linux kernel 2.6.18-6-686 #1 SMP i686 GNU/Linux 
while the nodes where we're having these problems are Linux kernel 2.6.32-5-amd64 #1 SMP x86_64 GNU/Linux


Any idea what can be happening?

best regards


Fernando
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121107/9458907b/attachment.html 


More information about the torqueusers mailing list