[torqueusers] A strange problem with jobs getting killed during
garrick at clusterresources.com
Tue Nov 28 14:13:59 MST 2006
On Mon, Nov 27, 2006 at 10:29:54PM -0500, Prakash Velayutham alleged:
> I have a strange problem. One of my users is running a bunch of
> perl-based serial jobs in a cluster using Torque-2.1.6. His jobs
> typically run for more than a day.
> Earlier, we had noticed that his jobs generally gets stopped at around
> 3am in the morning. But it was not consistent. Today I realized that my
> home directory NFS server does its quotacheck at 3 am on Mon, Wed, Fri.
> And that is exactly the time when his jobs stop and get killed.
> To check this, I changed the quotacheck cron to run at 8 pm tonight. And
> I see that 6-7 of his jobs had been killed right around that time.
> Here is an excerpt from the server_priv/accounting file.
> Some of them have an exit status of 30 and the rest 13. I have no idea
> what these mean. Any help?
The exit status doesn't mean anything to TORQUE, it is just passing
along whatever exit value was left from the user's login shell. This is
usually the same as the batch script, unless a logout script was executed.
If, for example, the batch script was running under bash, you know that
127 means "command not found" and 128+n is termination from a signal
(where n is the signal number).
It is probably safe to assume that the exit value is the errno from an
error generated from within the program:
$ perror 30
Error code 30: Read-only file system
$ perror 13
Error code 13: Permission denied
Did your program print an error message to stderr? It would be in the
job's output file.
More information about the torqueusers