[torqueusers] Problem with PBS killing jobs that exceed cput limit
fmalick at yahoo.com.ar
Wed Nov 7 09:47:48 MST 2012
Dear Sirs: I'm having a problem with torque/maui. Some long jobs are being killed by PBS because the reported cput exceedes the established limit
However, the cput value reported is impossible.
This is what I get on the email report:
This is what appears in the log file of the job:
=>> PBS: job killed: cput 9223372036 exceeded limit 15552000
The limit is quite generous
15552000 seconds --> 4320 hours --> 180 days
Number of CPUs (cores) used for this job: 12
walltime of the job when killed: 49*3600+48*60+37 = 179317 sec
Considering all 12 cores together, that would give 12x179317 = 2151804 sec of total cput and not the value reported 9223372036 which is exremely large.
Obviously something is happening, probably with pbs_mom, but don't know what.
We are using torque 2.5.2 (compiled from scratch), but on some of the nodes we have torque-mom (debian package with pbs-mom) version 2.4.8
Also there's another difference in equipments: the torque/maui server is Linux kernel 2.6.18-6-686 #1 SMP i686 GNU/Linux
while the nodes where we're having these problems are Linux kernel 2.6.32-5-amd64 #1 SMP x86_64 GNU/Linux
Any idea what can be happening?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers