[torqueusers] Problem with PBS killing jobs that exceed cput limit
Fernando Malick
fmalick at yahoo.com.ar
Wed Nov 7 09:47:48 MST 2012
Dear Sirs: I'm having a problem with torque/maui. Some long jobs are being killed by PBS because the reported cput exceedes the established limit
However, the cput value reported is impossible.
This is what I get on the email report:
Exit_status=143
resources_used.cput=2562047:47:16
resources_used.mem=776112kb
resources_used.vmem=5000268kb
resources_used.walltime=49:48:37
This is what appears in the log file of the job:
=>> PBS: job killed: cput 9223372036 exceeded limit 15552000
2.4.8
The limit is quite generous
15552000 seconds --> 4320 hours --> 180 days
Number of CPUs (cores) used for this job: 12
walltime of the job when killed: 49*3600+48*60+37 = 179317 sec
Considering all 12 cores together, that would give 12x179317 = 2151804 sec of total cput and not the value reported 9223372036 which is exremely large.
Obviously something is happening, probably with pbs_mom, but don't know what.
We are using torque 2.5.2 (compiled from scratch), but on some of the nodes we have torque-mom (debian package with pbs-mom) version 2.4.8
Also there's another difference in equipments: the torque/maui server is Linux kernel 2.6.18-6-686 #1 SMP i686 GNU/Linux
while the nodes where we're having these problems are Linux kernel 2.6.32-5-amd64 #1 SMP x86_64 GNU/Linux
Any idea what can be happening?
best regards
Fernando
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121107/9458907b/attachment.html
More information about the torqueusers
mailing list