Bug 34 - torque 2.4.X breaks OSC's mpiexec (pbs_statjob broken?)
: torque 2.4.X breaks OSC's mpiexec (pbs_statjob broken?)
Status: RESOLVED FIXED
Product: TORQUE
libtorque
: 2.4.x
: All All
: P1 critical
Assigned To: Joshua Bernstein
:
:
:
  Show dependency treegraph
 
Reported: 2009-11-24 21:43 MST by Glen
Modified: 2009-12-03 19:07 MST (History)
3 users (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description Glen 2009-11-24 21:43:39 MST
OSC's mpiexec makes a call to pbs_statjob to get the exec_host list.  Starting
in TORQUE 2.4 mpiexec is unable to get the exec_host attribute for the job, and
it  produces the following error message:

mpiexec: Error: get_hosts: pbs_statjob did not return "exec_host" info.


OSC mpiexec works as expected in torque 2.3.x
Comment 1 Joshua Bernstein 2009-11-30 14:18:56 MST
(In reply to comment #0)
> OSC's mpiexec makes a call to pbs_statjob to get the exec_host list.  Starting
> in TORQUE 2.4 mpiexec is unable to get the exec_host attribute for the job, and
> it  produces the following error message:
> 
> mpiexec: Error: get_hosts: pbs_statjob did not return "exec_host" info.
> 
> 
> OSC mpiexec works as expected in torque 2.3.x

I'll have a look at this.
Comment 2 Glen 2009-11-30 21:35:46 MST
assigning
Comment 3 Glen 2009-12-02 18:14:34 MST
Apparently Adaptive Computing developers have fixed the problem and checked in
the change to subversion 2.4-fixes branch today. I do not know if the fix has
been merged into trunk yet. 

I asked that they keep community developers in the loop.
Comment 4 Joshua Bernstein 2009-12-03 13:00:04 MST
(In reply to comment #3)
> Apparently Adaptive Computing developers have fixed the problem and checked in
> the change to subversion 2.4-fixes branch today. I do not know if the fix has
> been merged into trunk yet. 
> 
> I asked that they keep community developers in the loop.

Really? Wish they would have posted as such in here.
Comment 5 Denis Charland 2009-12-03 13:24:40 MST
A patch for 2.4.2 would be greatly appreciated.
Comment 6 Ken Nielson 2009-12-03 13:40:47 MST
a diff of stat_job.c is shown below:

@@ -233,34 +233,6 @@
     {
     /* client specified certain attributes */

-    if (pal->al_valln != 0)
-      {
-      /* HACK - report pal via high-throughput attr list */
-
-      for (;pal != NULL;pal = (svrattrl *)GET_NEXT(pal->al_link))
-        {
-        index = pal->al_valln;
-
-        if (((padef + index)->at_flags & priv) &&
-            !((padef + index)->at_flags & ATR_DFLAG_NOSTAT))
-          {
-          if (!(((padef + index)->at_flags & ATR_DFLAG_PRIVR) && (IsOwner ==
0)))
-            {
-            (padef + index)->at_encode(
-              pattr + index,
-              phead,
-              (padef + index)->at_name,
-              NULL,
-              ATR_ENCODE_CLIENT);
-            }
-          }
-        }    /* END for (pal) */
-
-      /* SUCCESS */
-
-      return(0);
-      }
-
     while (pal != NULL)
       {
       ++nth;
This code used the length of pal->al_valln as an index into the attribute table
which would return the wrong attribute. In the case of mpiexec host_exec was
not found and the call failed. Removing this code solved the problem.

It's nice to see all of your feedback. We will do better to respond as we hear
you.

Fixed in 2.4-fixes and trunk
Comment 7 Denis Charland 2009-12-03 19:07:54 MST
Ken, could you generate the patch using "diff -urN" so it could be used with
the patch command.