[torquedev] MOM interrupted system call and communication problems

Neil Hodgson neil.hodgson at sirca.org.au
Thu Mar 5 15:35:42 MST 2009


    A test installation of TORQUE PBS 2.3.6 is showing some error messages. I 
have been able to trace one message down and have a fix but haven't worked out 
the others yet. These errors occur readily when a large number of small test 
jobs are submitted although there have been some at other times.

1) pbs_mom: Interrupted system call (4) in conf_res, pipe read

2) pbs_server: stream_eof, connection to castle-dev101.sirca.org.au is bad, 
remote service may be down, message may be corrupt, or connection may have been 
dropped remotely (End of File).  setting node state to down

3) pbs_mom: Broken pipe (32) in rm_request, write request response failed: 
Protocol failure in commit message refused from port 1021 addr 10.255.255.200

    Message 1 is caused by the MOM receiving a signal while running a shell 
command. The mom_priv/config file contains a shell item that uses awk to format 
/proc/loadavg. This is run using popen (in mom_main.c:conf_res) and the result 
read from the pipe. It takes a little time to run awk, and a signal may occur 
(I'm guessing a SIGCHLD from a completing job) which causes the read to fail 
with errno EINTR. This causes conf_res to return with an error although it could 
continue reading just like another use of popen in the same file (in 
reqvarattr). A patch that fixes the problem is:

--- src/resmom/mom_main.c       (revision 14133)
+++ src/resmom/mom_main.c       (working copy)
@@ -4014,6 +4014,8 @@
    child_len = 0;
    child_spot[0] = '\0';

+retryread:
+
    while ((len = read(fd, child_spot, ret_size - child_len)) > 0)
      {
      for (i = 0;i < len;i++)
@@ -4040,6 +4042,9 @@

    if (len == -1)
      {
+    if (errno == EINTR)
+      goto retryread;
+
      log_err(errno, id, "pipe read");

      sprintf(ret_string, "? %d",

    It is possible that the other messages may also be caused by signal issues. 
Message 3 could point to a signal occurring within the sendto call inside 
rpp.c:rpc_send_out although there isn't much evidence for this yet. These issues 
may occur much more often when there are many short jobs since there is more 
opportunity for collisions between SIGCHLD signals and other code but should 
also occur under lower load.

    Neil


More information about the torquedev mailing list