[torqueusers] torque-2.1.6 - pbs_mom cannot write to its log
Alessandro Federico
alessandro.federico at caspur.it
Wed Oct 17 03:49:01 MDT 2007
Hi all.
I'm running torque-2.1.6 on SLES10 x86_64 (2.6.16.27-0.9-smp).
Sometimes I observe this strange behavior:
1) before a node starts/joins the first job of the day
the file descriptor of the log file is correct
--------------------------------------------
# lsof -p `pidof pbs_mom` | grep mom_logs
pbs_mom 7541 root 3w REG 8,1 208319 126550
/opt/spool/torque/mom_logs/20071017
--------------------------------------------
2) after the node starts/joins the first job of the day
the file descriptor of the log file becomes corrupted
--------------------------------------------
# lsof -p `pidof pbs_mom`
[...]
pbs_mom 7541 root 3w sock 0,5 1244251 can't
identify protocol
[...]
# ll /proc/`pidof pbs_mom`/fd
total 11
lr-x------ 1 root root 64 2007-10-15 14:39 0 -> /dev/null
l-wx------ 1 root root 64 2007-10-15 14:39 1 -> /dev/null
lrwx------ 1 root root 64 2007-10-16 11:06 11 -> socket:[1244252]
l-wx------ 1 root root 64 2007-10-15 14:39 2 -> /dev/null
l-wx------ 1 root root 64 2007-10-15 14:39 3 -> socket:[1244251]
l-wx------ 1 root root 64 2007-10-15 14:39 4 ->
/opt/spool/torque/mom_priv/mom.lock
lrwx------ 1 root root 64 2007-10-15 14:39 5 -> socket:[679046]
lrwx------ 1 root root 64 2007-10-15 14:39 6 -> socket:[679047]
lrwx------ 1 root root 64 2007-10-15 14:39 7 -> socket:[679050]
lrwx------ 1 root root 64 2007-10-15 14:39 8 -> socket:[679051]
lr-x------ 1 root root 64 2007-10-15 14:39 9 -> /proc
--------------------------------------------
pbs_mom begins to fill up the sys logs with this message:
--------------------------------------------
Oct 17 09:37:05 inode18 pbs_mom: Broken pipe (32) in log_record, PBS
cannot write to its log
Oct 17 09:37:05 inode18 pbs_mom: Broken pipe (32) in log_record, PBS
cannot write to its log
[...]
--------------------------------------------
The last readable lines of pbs_mom log file looks like this:
--------------------------------------------
10/17/2007 09:37:02;0008;
pbs_mom;Job;94896.poseidon.caspur.it;received request 'JOIN_JOB' from
192.168.201.41:1023
10/17/2007 09:37:02;0008;
pbs_mom;Job;94896.poseidon.caspur.it;im_request: JOIN_JOB
94896.poseidon.caspur.it node 1
10/17/2007 09:37:02;0001; pbs_mom;Job;job_nodes;job:
94896.poseidon.caspur.it numnodes=8 numvnod=16
--------------------------------------------
the other lines contain binary data and the log file is corrupted
until pbs_mom will rotate it at midnight.
Does anybody have any idea?
Thanks in advance
Ale
--
Alessandro Federico
CASPUR http://www.caspur.it/
e-mail: alessandro.federico at caspur.it
phone: +39 06 44486708
fax: +39 06 4957083
------------------------------------------
Military intelligence is a contradiction
in terms. (Groucho Marx)
------------------------------------------
More information about the torqueusers
mailing list