[torqueusers] Torque/maui jobs terminated prematurely

Baker D.J. D.J.Baker at soton.ac.uk
Wed Sep 14 06:57:55 MDT 2005


Hi,
  We are running a computational cluster based on maui 3.2.6p11 and
torque 1.2.0p1. Just recently we have noticed that some of our jobs have
been terminated prematurely for no obvious reason. Some basic analysis
of the situation has indicated that the jobs terminate after using 30
minutes of cputime. Also it appears that the problem is completely
random -- in that it doesn't affect particular compute nodes or
particular jobs.

In our server configuration we have set values for both the default (2
hrs), and maximum (120 hrs) wallclock time . We have not set any
configuration values for job cputime usage -- therefore I assume that
the cputime usage for a job is implicitly infinite.

As anyone in the torque/maui community ever seen this sort of behaviour,
and does anyone have an feel for what might be happening, please? It's
almost a Unix "limit" is coming into force, or perhaps the torque system
is getting confused and setting a cpu limit. 

I suspect we might try to circumvent the issue by explicitly setting a
cputime limit via qmgr. In this respect .. for a parallel job -- is the
cputime specified in torque the sum of the cputimes used by each
process?

Any ideas or advice would be appreciated.

Thank you -- David Baker.



More information about the torqueusers mailing list