Bugzilla – Bug 34
torque 2.4.X breaks OSC's mpiexec (pbs_statjob broken?)
Last modified: 2009-12-03 19:07:54 MST
You need to log in before you can comment on or make changes to this bug.
OSC's mpiexec makes a call to pbs_statjob to get the exec_host list. Starting in TORQUE 2.4 mpiexec is unable to get the exec_host attribute for the job, and it produces the following error message: mpiexec: Error: get_hosts: pbs_statjob did not return "exec_host" info. OSC mpiexec works as expected in torque 2.3.x
(In reply to comment #0) > OSC's mpiexec makes a call to pbs_statjob to get the exec_host list. Starting > in TORQUE 2.4 mpiexec is unable to get the exec_host attribute for the job, and > it produces the following error message: > > mpiexec: Error: get_hosts: pbs_statjob did not return "exec_host" info. > > > OSC mpiexec works as expected in torque 2.3.x I'll have a look at this.
assigning
Apparently Adaptive Computing developers have fixed the problem and checked in the change to subversion 2.4-fixes branch today. I do not know if the fix has been merged into trunk yet. I asked that they keep community developers in the loop.
(In reply to comment #3) > Apparently Adaptive Computing developers have fixed the problem and checked in > the change to subversion 2.4-fixes branch today. I do not know if the fix has > been merged into trunk yet. > > I asked that they keep community developers in the loop. Really? Wish they would have posted as such in here.
A patch for 2.4.2 would be greatly appreciated.
a diff of stat_job.c is shown below: @@ -233,34 +233,6 @@ { /* client specified certain attributes */ - if (pal->al_valln != 0) - { - /* HACK - report pal via high-throughput attr list */ - - for (;pal != NULL;pal = (svrattrl *)GET_NEXT(pal->al_link)) - { - index = pal->al_valln; - - if (((padef + index)->at_flags & priv) && - !((padef + index)->at_flags & ATR_DFLAG_NOSTAT)) - { - if (!(((padef + index)->at_flags & ATR_DFLAG_PRIVR) && (IsOwner == 0))) - { - (padef + index)->at_encode( - pattr + index, - phead, - (padef + index)->at_name, - NULL, - ATR_ENCODE_CLIENT); - } - } - } /* END for (pal) */ - - /* SUCCESS */ - - return(0); - } - while (pal != NULL) { ++nth; This code used the length of pal->al_valln as an index into the attribute table which would return the wrong attribute. In the case of mpiexec host_exec was not found and the call failed. Removing this code solved the problem. It's nice to see all of your feedback. We will do better to respond as we hear you. Fixed in 2.4-fixes and trunk
Ken, could you generate the patch using "diff -urN" so it could be used with the patch command.