[torquedev] epilogue job exit code

Gareth.Williams at csiro.au Gareth.Williams at csiro.au
Sun Jun 19 21:53:40 MDT 2011


> > > > Can anyone confirm that there is a bug in the job exit code (10th
> argument
> > > > to epilogue)?  I get the right Exit_status in the server and
> accounting log, but
> > > > in the epilogue I seem to get the a particular number regardless
> of the exit
> > > > code and it seems to be my numeric uid (18686).
> > >
> > > if the epilogue is a shell script, the 10th argument must be
> enclosed
> > > in curly parenthesis, ${10}.
> > > Otherwise you get the 1.st argument (uid) with a zero appended.
> > >
> > > (Made this error myself and posted to the list few months back.)
> >
> > Thanks but that's not my problem.  Curly parenthesis give me 18686.
> If you
> > run a job that has 'exit 3' does ${10} contain '3'? If so, what
> version of
> > torque?
> 
> Version 2.4.12 does it right.

I've tested further and can see how changes introduced in the 3.0... torque version cause the issue I see.  I think it's exposing an existing bug.

The relevant change is the addition of ji_momport and ji_mom_rmport in pbs_job.h:
      struct   /* if in execution queue .. */
        {
        pbs_net_t ji_momaddr;  /* host addr of Server */
        unsigned short ji_momport;  /* host port of Server default 15002 */
        unsigned short ji_mom_rmport; /* host mom manager port of Server default 15003 */
        int       ji_exitstat; /* job exit status from MOM */
        } ji_exect;

This changes the layout of the memory relative to another struct in the union:
      struct
        {
        pbs_net_t ji_svraddr;  /* host addr of Server */
        int       ji_exitstat; /* job exit status from MOM */
        uid_t     ji_exuid;    /* execution uid */
        gid_t     ji_exgid;    /* execution gid */
        } ji_momt;

When run_pelog in prologue.c gets the exitstat:
      sprintf(exit_stat,"%d",
              pjob->ji_qs.ji_un.ji_exect.ji_exitstat);
it does so from ji_exect but I think it should actually come from ji_momt.  In the old code, the definitions were 'compatible' so the bug was not apparent, but now ji_exect.ji_exitstat lines up with ji_momt.ji_exuid - which is consistent with me seeing the numeric uid in the epilogue exit field.

I think the right fix is to change the lines in prolog.c to be:
      sprintf(exit_stat,"%d",
              pjob->ji_qs.ji_un.ji_momt.ji_exitstat);
and leave pbs_job.h as is. Possibly the names should be changed to avoid the ambiguity.  Perhaps the ji_exect member should be 'ji_currentstat'.


regards,

Gareth


More information about the torquedev mailing list