[torqueusers] prologue not executed if submituser doesn't exist on MOM node
ramon.bastiaans at sara.nl
Thu Jun 18 03:05:26 MDT 2009
We noticed a 'black hole' node can occur, when a user doesn't exist on a
node in the cluster. This can be catastrophic if the node loses it's
LDAP connection for example. The kicker is that the job disappears for
the user as if it was never submitted, with only a small error on the MOM:
pbs_mom;Svr;pbs_mom;start_exec, No Password Entry for User somebody
Only when submitting interactively you see some sort of error/weirdness:
qsub: waiting for job 2277616.something to start
qsub: job 2277616.something apparently deleted
Non-interactive jobs just 'disappear' with users wondering what happened
to their jobs.
Now it seems the 'prologue' script is not executed before this
user/password entry error occurs. I wanted to check if the user id
existed on a node before start, from within the prologue. I realise
there is a "node health check" script, but that would not allow me to
perform this check very robustly.
Am I correct in my assumption here that prologue is not executed when
the MOM detects the submit user does not exist? Is there any way to
change this behaviour?
R. Bastiaans, B.ICT :: Systems Programmer, HPC&V
SARA - Computing & Networking Services
Science Park 121 PO Box 94613
1098 XG Amsterdam NL 1090 GP Amsterdam NL
P.+31 (0)20 592 3000 F.+31 (0)20 668 3167
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 3332 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20090618/4585da4d/attachment.bin
More information about the torqueusers