[torquedev] MOM interrupted system call and communication problems
Neil Hodgson
neil.hodgson at sirca.org.au
Thu Mar 5 15:35:42 MST 2009
A test installation of TORQUE PBS 2.3.6 is showing some error messages. I
have been able to trace one message down and have a fix but haven't worked out
the others yet. These errors occur readily when a large number of small test
jobs are submitted although there have been some at other times.
1) pbs_mom: Interrupted system call (4) in conf_res, pipe read
2) pbs_server: stream_eof, connection to castle-dev101.sirca.org.au is bad,
remote service may be down, message may be corrupt, or connection may have been
dropped remotely (End of File). setting node state to down
3) pbs_mom: Broken pipe (32) in rm_request, write request response failed:
Protocol failure in commit message refused from port 1021 addr 10.255.255.200
Message 1 is caused by the MOM receiving a signal while running a shell
command. The mom_priv/config file contains a shell item that uses awk to format
/proc/loadavg. This is run using popen (in mom_main.c:conf_res) and the result
read from the pipe. It takes a little time to run awk, and a signal may occur
(I'm guessing a SIGCHLD from a completing job) which causes the read to fail
with errno EINTR. This causes conf_res to return with an error although it could
continue reading just like another use of popen in the same file (in
reqvarattr). A patch that fixes the problem is:
--- src/resmom/mom_main.c (revision 14133)
+++ src/resmom/mom_main.c (working copy)
@@ -4014,6 +4014,8 @@
child_len = 0;
child_spot[0] = '\0';
+retryread:
+
while ((len = read(fd, child_spot, ret_size - child_len)) > 0)
{
for (i = 0;i < len;i++)
@@ -4040,6 +4042,9 @@
if (len == -1)
{
+ if (errno == EINTR)
+ goto retryread;
+
log_err(errno, id, "pipe read");
sprintf(ret_string, "? %d",
It is possible that the other messages may also be caused by signal issues.
Message 3 could point to a signal occurring within the sendto call inside
rpp.c:rpc_send_out although there isn't much evidence for this yet. These issues
may occur much more often when there are many short jobs since there is more
opportunity for collisions between SIGCHLD signals and other code but should
also occur under lower load.
Neil
More information about the torquedev
mailing list