[torqueusers] torque-2.1.6 - pbs_mom cannot write to its log

Alessandro Federico alessandro.federico at caspur.it
Thu Oct 18 03:09:09 MDT 2007


Garrick Staples wrote:
> On Wed, Oct 17, 2007 at 11:49:01AM +0200, Alessandro Federico alleged:
>> Hi all.
>>
>> I'm running torque-2.1.6 on SLES10 x86_64 (2.6.16.27-0.9-smp).
>> Sometimes I observe this strange behavior:
>>
>> 1) before a node starts/joins the first job of the day
>> the file descriptor of the log file is correct
>>
>> --------------------------------------------
>> # lsof -p `pidof pbs_mom` | grep mom_logs
>> pbs_mom 7541 root    3w   REG     8,1  208319 126550 
>> /opt/spool/torque/mom_logs/20071017
>> --------------------------------------------
>>
>> 2) after the node starts/joins the first job of the day
>> the file descriptor of the log file becomes corrupted
> 
> It's probably some other memory corruption going on.  Can you duplicate with 2.1.9?
> 

Of course you're right but I can't upgrade to 2.1.9 at the moment
(the upgrade is planned at the end of November).

The strange thing is that we are running torque-2.1.6 on our
cluster since January 2007 and it was working until the end
of September (when the problem has begun).

On July we upgrade the SLES10 kernel from 2.6.16.21-0.25-smp
to 2.6.16.27-0.9-smp.
Can the problem be related to this upgrade?

Thanks
Ale

> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-- 
  Alessandro Federico
  CASPUR     http://www.caspur.it/
  e-mail:    alessandro.federico at caspur.it
  phone:     +39 06 44486708
  fax:       +39 06 4957083
------------------------------------------
  Military intelligence is a contradiction
  in terms.                 (Groucho Marx)
------------------------------------------


More information about the torqueusers mailing list