[torquedev] epilogue job exit code
Gareth.Williams at csiro.au
Gareth.Williams at csiro.au
Sun Jun 19 21:53:40 MDT 2011
> > > > Can anyone confirm that there is a bug in the job exit code (10th
> argument
> > > > to epilogue)? I get the right Exit_status in the server and
> accounting log, but
> > > > in the epilogue I seem to get the a particular number regardless
> of the exit
> > > > code and it seems to be my numeric uid (18686).
> > >
> > > if the epilogue is a shell script, the 10th argument must be
> enclosed
> > > in curly parenthesis, ${10}.
> > > Otherwise you get the 1.st argument (uid) with a zero appended.
> > >
> > > (Made this error myself and posted to the list few months back.)
> >
> > Thanks but that's not my problem. Curly parenthesis give me 18686.
> If you
> > run a job that has 'exit 3' does ${10} contain '3'? If so, what
> version of
> > torque?
>
> Version 2.4.12 does it right.
I've tested further and can see how changes introduced in the 3.0... torque version cause the issue I see. I think it's exposing an existing bug.
The relevant change is the addition of ji_momport and ji_mom_rmport in pbs_job.h:
struct /* if in execution queue .. */
{
pbs_net_t ji_momaddr; /* host addr of Server */
unsigned short ji_momport; /* host port of Server default 15002 */
unsigned short ji_mom_rmport; /* host mom manager port of Server default 15003 */
int ji_exitstat; /* job exit status from MOM */
} ji_exect;
This changes the layout of the memory relative to another struct in the union:
struct
{
pbs_net_t ji_svraddr; /* host addr of Server */
int ji_exitstat; /* job exit status from MOM */
uid_t ji_exuid; /* execution uid */
gid_t ji_exgid; /* execution gid */
} ji_momt;
When run_pelog in prologue.c gets the exitstat:
sprintf(exit_stat,"%d",
pjob->ji_qs.ji_un.ji_exect.ji_exitstat);
it does so from ji_exect but I think it should actually come from ji_momt. In the old code, the definitions were 'compatible' so the bug was not apparent, but now ji_exect.ji_exitstat lines up with ji_momt.ji_exuid - which is consistent with me seeing the numeric uid in the epilogue exit field.
I think the right fix is to change the lines in prolog.c to be:
sprintf(exit_stat,"%d",
pjob->ji_qs.ji_un.ji_momt.ji_exitstat);
and leave pbs_job.h as is. Possibly the names should be changed to avoid the ambiguity. Perhaps the ji_exect member should be 'ji_currentstat'.
regards,
Gareth
More information about the torquedev
mailing list