lloyd_brown at byu.edu
Fri Nov 4 09:34:59 MDT 2011
I know that exit status gets offset by some number (128? 256?), but it's
not clear to me whether there is a correlation between the signal number
(SIGTERM, or signal 15), and the program's exit status. If a program
that is killed by signal 15, sends a exit code of 15, and if the offset
is 256, that would explain the exit code you see of 271 (256+15).
>From the snippet of logs, it looks like Maui decided somehow to delete
the job. SIGTERM (15) is the first signal that Torque sends to the
job's process; if it fails to exit in a short period, it then sends
SIGKILL (9), which can't be caught/ignored. We sometimes have users
catch TERM in their job script, and do some cleanup.
I'd look into why Maui decided to delete it, if I were you. That's
likely the root of the problem.
Fulton Supercomputing Lab
Brigham Young University
On 11/04/2011 09:18 AM, David Beer wrote:
> ----- Original Message -----
>> Hi at all,
>> We currently use torque 2.4.12 and maui 3.2.6p21 on a cluster.
>> A job that was running for several hours has been deleted at the
>> request of maui (20 nodes and ppn=4). Here is a part of the torque's
>> log :
>> 10/25/2011 11:08:46;0008;PBS_Server;Job;269580.;Job deleted at
>> of maui@
>> 10/25/2011 11:08:46;0008;PBS_Server;Job;269580.;Job sent signal
>> SIGTERM on delete
>> 10/25/2011 11:08:47;0009;PBS_Server;Job;269580.;job exit status 271
>> 10/25/2011 11:08:48;0010;PBS_Server;Job;269580.;Exit_status=271
>> resources_used.cput=371:25:04 resources_used.mem=589796kb
>> resources_used.vmem=12901480kb resources_used.walltime=114:25:11
>> I'm wondering what could be the cause of this exit status 271.
>> The only causes i found were "More RAM than asked for or over
>> allocated CPU time are the usual reasons".
>> This doesn't seem to be the reason here.
>> Any idea?
> This is because the job is being killed by signal 15, its an oddity in linux.
More information about the torqueusers