[torqueusers] torque-2.1.6 - pbs_mom cannot write to its log

Alessandro Federico alessandro.federico at caspur.it
Wed Oct 17 03:49:01 MDT 2007


Hi all.

I'm running torque-2.1.6 on SLES10 x86_64 (2.6.16.27-0.9-smp).
Sometimes I observe this strange behavior:

1) before a node starts/joins the first job of the day
the file descriptor of the log file is correct

--------------------------------------------
# lsof -p `pidof pbs_mom` | grep mom_logs
pbs_mom 7541 root    3w   REG     8,1  208319 126550 
/opt/spool/torque/mom_logs/20071017
--------------------------------------------

2) after the node starts/joins the first job of the day
the file descriptor of the log file becomes corrupted

--------------------------------------------
# lsof -p `pidof pbs_mom`
[...]
pbs_mom 7541 root    3w  sock                0,5         1244251 can't 
identify protocol
[...]

# ll /proc/`pidof pbs_mom`/fd
total 11
lr-x------ 1 root root 64 2007-10-15 14:39 0 -> /dev/null
l-wx------ 1 root root 64 2007-10-15 14:39 1 -> /dev/null
lrwx------ 1 root root 64 2007-10-16 11:06 11 -> socket:[1244252]
l-wx------ 1 root root 64 2007-10-15 14:39 2 -> /dev/null
l-wx------ 1 root root 64 2007-10-15 14:39 3 -> socket:[1244251]
l-wx------ 1 root root 64 2007-10-15 14:39 4 -> 
/opt/spool/torque/mom_priv/mom.lock
lrwx------ 1 root root 64 2007-10-15 14:39 5 -> socket:[679046]
lrwx------ 1 root root 64 2007-10-15 14:39 6 -> socket:[679047]
lrwx------ 1 root root 64 2007-10-15 14:39 7 -> socket:[679050]
lrwx------ 1 root root 64 2007-10-15 14:39 8 -> socket:[679051]
lr-x------ 1 root root 64 2007-10-15 14:39 9 -> /proc
--------------------------------------------

pbs_mom begins to fill up the sys logs with this  message:

--------------------------------------------
Oct 17 09:37:05 inode18 pbs_mom: Broken pipe (32) in log_record, PBS 
cannot write to its log
Oct 17 09:37:05 inode18 pbs_mom: Broken pipe (32) in log_record, PBS 
cannot write to its log
[...]
--------------------------------------------

The last readable lines of pbs_mom log file looks like this:

--------------------------------------------
10/17/2007 09:37:02;0008; 
pbs_mom;Job;94896.poseidon.caspur.it;received request 'JOIN_JOB' from 
192.168.201.41:1023
10/17/2007 09:37:02;0008; 
pbs_mom;Job;94896.poseidon.caspur.it;im_request: JOIN_JOB 
94896.poseidon.caspur.it node 1
10/17/2007 09:37:02;0001;   pbs_mom;Job;job_nodes;job: 
94896.poseidon.caspur.it numnodes=8 numvnod=16
--------------------------------------------

the other lines contain binary data and the log file is corrupted
until pbs_mom will rotate it at midnight.

Does anybody have any idea?

Thanks in advance
Ale

-- 
  Alessandro Federico
  CASPUR     http://www.caspur.it/
  e-mail:    alessandro.federico at caspur.it
  phone:     +39 06 44486708
  fax:       +39 06 4957083
------------------------------------------
  Military intelligence is a contradiction
  in terms.                 (Groucho Marx)
------------------------------------------


More information about the torqueusers mailing list