[torqueusers] death of pbs_mom's

Brock Palen brockp at umich.edu
Wed Oct 11 12:08:17 MDT 2006


On Oct 11, 2006, at 12:29 PM, Jerry Smith wrote:

> Brock,
>
>
>>
>> We have been using torque-2.1.2  for awhile on a amd64 rhel4 cluster.
>> In the last few days we started seeing mom's dying on the nodes.
>> While they are not dying enmass,  there has been no noticeable
>> coralation between the nodes that have had them die.
>>
>> The logs say nothing.  This causes jobs to hand in the queue in the
>> Running state.  And can only be removed by purging the job.   These
>> hung jobs go on to cause Moab to get angry and not stop trying to
>> remove the now very old job.  Thus does not sleep ever.
>>
>> Has anyone else ever seen a problem when a mom dies?
>>
>
>
> We had a similar problem a while back, and actually found a  
> segfault ( this
> was in the 2.0 series )
>
> Do you have :
> PBSCOREDUMP=1
> Set in your $PBS_HOME/environment
I have not but will look into doing that.

>
> This causes the mom to dump a core that you can look into, which  
> gave us the
> information to pass along to the torque-devs and resulted in a fix.
>
>
> How are you "purging" the job?
qdel -p JOBID  This deletes a job that says a cancel is already in  
progress.

>
>
>
> Jerry
>
>
>
>



More information about the torqueusers mailing list