[torqueusers] Exit_status=271

Lloyd Brown lloyd_brown at byu.edu
Fri Nov 4 09:34:59 MDT 2011


I know that exit status gets offset by some number (128? 256?), but it's
not clear to me whether there is a correlation between the signal number
(SIGTERM, or signal 15), and the program's exit status.  If a program
that is killed by signal 15, sends a exit code of 15, and if the offset
is 256, that would explain the exit code you see of 271 (256+15).

>From the snippet of logs, it looks like Maui decided somehow to delete
the job.  SIGTERM (15) is the first signal that Torque sends to the
job's process; if it fails to exit in a short period, it then sends
SIGKILL (9), which can't be caught/ignored.  We sometimes have users
catch TERM in their job script, and do some cleanup.

I'd look into why Maui decided to delete it, if I were you.  That's
likely the root of the problem.

Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

On 11/04/2011 09:18 AM, David Beer wrote:
> 
> 
> ----- Original Message -----
>> Hi at all,
>>
>> We currently use torque 2.4.12 and maui 3.2.6p21 on a cluster.
>>
>> A job that was running for several hours has been deleted at the
>> request of maui (20 nodes and ppn=4). Here is a part of the torque's
>> log :
>>
>> 10/25/2011 11:08:46;0008;PBS_Server;Job;269580.;Job deleted at
>> request
>> of maui@
>> 10/25/2011 11:08:46;0008;PBS_Server;Job;269580.;Job sent signal
>> SIGTERM on delete
>> 10/25/2011 11:08:47;0009;PBS_Server;Job;269580.;job exit status 271
>> handled
>> 10/25/2011 11:08:48;0010;PBS_Server;Job;269580.;Exit_status=271
>> resources_used.cput=371:25:04 resources_used.mem=589796kb
>> resources_used.vmem=12901480kb resources_used.walltime=114:25:11
>>
>>
>> I'm wondering what could be the cause of this exit status 271.
>> The only causes i found were "More RAM than asked for or over
>> allocated CPU time are the usual reasons".
>> This doesn't seem to be the reason here.
>>
>> Any idea?
>>
>> Regards,
>> Cédric.
> 
> This is because the job is being killed by signal 15, its an oddity in linux.
> 


More information about the torqueusers mailing list