[torquedev] job commit TRANSICM bug
garrick at clusterresources.com
Mon Jul 24 15:03:57 MDT 2006
For the archives...
Late last week I found a node was rejecting all new jobs with "Execution
server rejected request REJHOST=xxxxxx MSG=cannot send job to xxxxxx,
state=PRERUN" in the server log, and this in the MOM log:
07/20/2006 22:03:49;0001; pbs_mom;Svr;pbs_mom;Success (0) in req_jobscript, job 1321953.hpc-pbs.usc.edu in unexpected state 'TRANSICM'
07/20/2006 22:03:49;0080; pbs_mom;Req;req_reject;Reject reply code=15004(Invalid request REJHOST=hpc0279.usc.edu MSG=job 1321953.hpc-pbs.usc.edu in unexpected state 'TRANSICM'), aux=0, type=JobScript, from PBS_Server at hpc-pbs.usc.edu
As it rejected a thousand jobs over an hour or so, the jobid in the MOM
log was always the same, but the server log was mentioning new jobids.
My understanding of the bug is this:
1) server sends a job to MOM, sends the jobscript, sends a
readytocommit, and the final commit. Somewhere in the 2 commits, a
failure happened. pbs_server purged it's copy of the job and forgot
about it. MOM still has the job in TRANSICM substate in it's "new job
2) server sends a new job to MOM, sends the jobscript, MOM confuses
this jobscript request with the failed job, finds it in the wrong state,
and rejects it.
Turns out that the jobid is not in the jobscript request! MOM was
matching the request with job structs in the "new job list" by socket
number! But the second job was re-using the previous socket number!
If you are still reading... I've checked in changes that add the jobid
to the jobscript request to prevent the confusion, but I think the
process of looking up these new jobs is still flawed.
The other side of this is that there is no mechanism for clearing stale
new jobs from MOM after a commit failure.
More information about the torquedev