[torqueusers] death of pbs_mom's
Brock Palen
brockp at umich.edu
Wed Oct 11 12:08:17 MDT 2006
On Oct 11, 2006, at 12:29 PM, Jerry Smith wrote:
> Brock,
>
>
>>
>> We have been using torque-2.1.2 for awhile on a amd64 rhel4 cluster.
>> In the last few days we started seeing mom's dying on the nodes.
>> While they are not dying enmass, there has been no noticeable
>> coralation between the nodes that have had them die.
>>
>> The logs say nothing. This causes jobs to hand in the queue in the
>> Running state. And can only be removed by purging the job. These
>> hung jobs go on to cause Moab to get angry and not stop trying to
>> remove the now very old job. Thus does not sleep ever.
>>
>> Has anyone else ever seen a problem when a mom dies?
>>
>
>
> We had a similar problem a while back, and actually found a
> segfault ( this
> was in the 2.0 series )
>
> Do you have :
> PBSCOREDUMP=1
> Set in your $PBS_HOME/environment
I have not but will look into doing that.
>
> This causes the mom to dump a core that you can look into, which
> gave us the
> information to pass along to the torque-devs and resulted in a fix.
>
>
> How are you "purging" the job?
qdel -p JOBID This deletes a job that says a cancel is already in
progress.
>
>
>
> Jerry
>
>
>
>
More information about the torqueusers
mailing list