[torqueusers] Job fails because prologue.user doesn't exist

Adam DeConinck ajdecon at ajdecon.org
Wed Jan 30 17:33:29 MST 2013


Hey all,

Running into an odd issue with Torque 4.1.0 and wondering if anyone has ideas.

I recently built a new image for our compute nodes, with the only
change being (AFAICT) installing the newest Mellanox OFED. The Torque
RPMs are the same in both images. However, nodes which boot into this
image fail to start any jobs which run on them, for reasons that don't
seem to relate to the IB.

Instead, the logs seem to indicate that they are failing because of
the lack of a mom_priv/prologue.user file. This confuses me because
(a) we have never had such a file, and (b) if I go in and create empty
prologue.user and epilogue.user scripts, it still fails with the exact
same messages!

>From mom_log:
01/30/2013 16:20:12;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
4.1.0, loglevel = 0
01/30/2013 16:22:48;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
started, Failure job exec failure, after files staged, no retry (see
syslog for more information)
01/30/2013 16:22:48;0001;
pbs_mom;Job;7408.wwmaster.psg.cluster.zone;ALERT:  job failed phase 3
start
01/30/2013 16:22:48;0008;   pbs_mom;Req;send_sisters;sending ABORT to
sisters for job 7408.wwmaster.psg.cluster.zone
01/30/2013 16:22:48;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::run_pelog,
prolog/epilog failed, file: /var/spool/torque/mom_priv/epilogue.user,
exit: 13, cannot stat
01/30/2013 16:22:48;0001;
pbs_mom;Svr;pbs_mom;LOG_ERROR::run_epilogues, user epilog failed -
interactive job
01/30/2013 16:22:48;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
01/30/2013 16:22:48;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop
01/30/2013 16:22:48;0080;   pbs_mom;Svr;preobit_reply;in while loop,
no error from job stat
01/30/2013 16:22:48;0080;
pbs_mom;Job;7408.wwmaster.psg.cluster.zone;obit sent to server

>From syslog:
Jan 30 16:22:48 wm030 pbs_mom: LOG_ERROR::run_pelog, prolog/epilog
failed, file: /var/spool/torque/mom_priv/prologue.user, exit: 13,
cannot stat
Jan 30 16:22:48 wm030 pbs_mom: LOG_ERROR::handle_prologs, user prolog failed
Jan 30 16:22:48 wm030 pbs_mom: LOG_ERROR::run_pelog, prolog/epilog
failed, file: /var/spool/torque/mom_priv/epilogue.user, exit: 13,
cannot stat
Jan 30 16:22:48 wm030 pbs_mom: LOG_ERROR::run_epilogues, user epilog
failed - interactive job
Jan 30 16:22:48 wm030 pbs_mom: LOG_ERROR::run_pelog, prolog/epilog
failed, file: /var/spool/torque/mom_priv/epilogue.user, exit: 13,
cannot stat
Jan 30 16:22:48 wm030 pbs_mom: LOG_ERROR::preobit_reply, user epilog
failed - interactive job


It's worth noting that in the old image, the prologue.user and
epilogue.user messages are not even printed; the job just works.

Any ideas?

Thank you!
Adam


More information about the torqueusers mailing list