[torqueusers] completed jobs not registering with pbs_server and other goodies

Edward Simmonds esimm at fnal.gov
Fri Mar 20 12:56:21 MDT 2009


Hi all.

Please bear with me if I say something foolish, I've been involved with this Torque/Maui stuff now for, oh, about six days.

I'm going to give a few bullet points, for those who don't have time to read a long post, and then a longer explanation.

Here's the scenario:

We recently upgraded from maui-3.2.6p16/torque-2.2.1 to maui-3.2.6p21/torque-2.3.6.  Since then we've been having very 
serious performance problems seemingly related to problems connecting to the pbs_server.

A few issues:

1. If pbs_mom fails to notify the pbs_server that a job has terminated (via post_epiloque()), it never tries again. 
This results in an inconsistent state where the pbs_server believes jobs are still running that are terminated.  We can 
confirm this by examining the pbs_mom logs and finding where they failed to notify the pbs_server about a job 
completing.  Running "pbsnodes {node}" on the server still shows the jobs as running.  The message in the mom log looks 
like this:

03/19/2009 13:15:07;0001;   pbs_mom;Svr;pbs_mom;Operation now in progress (115) in post_epilogue, cannot bind to port 
1023 in client_to_svr - connection refused


2. Even more strangely, we find messages in the maui.log on the head node showing a job trying to start on one of these 
worker nodes, which pbs indicates are 'job-exclusive'.  Those job startups fail with:

03/20/2009 00:01:33;0080;PBS_Server;Req;req_reject;Reject reply code=15044(Resource temporarily unavailable 
REJHOST=worker186 MSG=cannot allocate node 'worker186' to job - node not currently available (nps needed/free: 1/0, 
joblist: 123281.cab1.fnal.gov:0,127446.cab1.fnal.gov:1)), aux=0, type=RunJob, from root at cab1.fnal.gov


3. We also have a problem where maui freezes for 15 minutes trying to query the pbs_server process.  During this time, 
maui is completely frozen and no jobs are scheduled.  I have seen this reported in the lists, but I don't see any 
solution other than an unofficial patch on 3.2.6p20 proposed by someone and modifying a maui parameter (which we've done).


So my questions are:

Why is pbs_server getting so busy it won't allow connections?

How do I get the pbs_mom and pbs_server to synchronize jobs after the mom fails to notify the server of a job completion?

Why is maui doing job starts on nodes that pbs_server, incorrectly, thinks are job-exclusive?

How do I stop maui from freezing for 15 minutes at a time?


Here's the longer analysis for those who haven't fallen asleep:

If we look at pbs_nodes {node}, we'll see jobs listed as running that we know are listed as completed in the worker node 
mom log.  I traced this through the code, and it appears to be an issue with the post_epiloque() procedure.  That 
procedure attempts to open a connection to the pbs_server and notify it that a job has finished-- the code refers to 
this as an "obit".  If that connection fails, the pbs_mom process doesn't attempt to notify the server that a job has 
completed again.

The relevant portion of the code is:

Source File:  resmom/catch_child.c- post_epiloque() procedure


   /* open new connection */

   sock = mom_open_socket_to_jobs_server(pjob, id, obit_reply);

   if (sock < 0)
     {
     /* FAILURE */

     if ((errno == EINTR) || (errno == ETIMEDOUT) || (errno == EINPROGRESS))
       {
       /* transient failure - server/network up but busy... retry */

       int retrycount;

       for (retrycount = 0;retrycount < 2;retrycount++)
         {
         sock = mom_open_socket_to_jobs_server(pjob, id, obit_reply);

         if (sock >= 0)
           break;
         }  /* END for (retrycount) */
       }

     if (sock < 0)
       {
       /* We are trying to send obit, but failed - where is this retried?
        * Answer: I think that the main_loop should examine jobs and try
        * every so often to send the obit.  This would work for recovered
        * jobs also.
        */

       return(1);
       }
     }

Please note that the comment at the end recognizes that a failed connection will require that the obit be attempted 
later, but I cannot see anywhere in the code that this is being done.  There is just a little snippet that hints that 
someone was thinking about this in mom_main.c:

   int        MOMRetryObit   = 0;  /* NOTE:  change to TRUE is 2.4 */


So it looks like this is being planned for 2.4, but not working in 2.3.6.


Similarly, maui on the head node freezes while trying to run a pbs_disconnect()  (after timing out requesting info the 
pbs server).  There are various messages about this problem floating around but I still haven't seen a solution.  We've 
already set RMCFG[base] TIMEOUT=120, but that doesn't seem to help.

It looks to me like the main problem here is that pbs_server is not accepting connections for some reason.  That seems 
to cause these other problems.

I apologize for the length of this post...


Thanks in advance!


Ed


More information about the torqueusers mailing list