[torqueusers] Mismatched job_id in pbs_mom
Joshua Bernstein
jbernstein at penguincomputing.com
Wed Jun 23 16:44:02 MDT 2010
Hello Folks!
When testing this specific workload within TORQUE we are running into an
issue where TORQUE (qstat) thinks there is a still a job in the running
state (R). Though logging into that node, ps -ef, confirms that no job
is indeed running there. Further still, the application's logfile
confirms the job in question exited cleanly.
If we have a peek inside of the pbs_mom's log for that node, pbs_mom
prints out a nice message saying something about a bug here. Notice the
BUG: line in the excerpt below:
20100621:06/21/2010 17:46:48;0001; pbs_mom;Job;TMomFinalizeJob3;job
994.scyld.localdomain started, pid = 73977
20100621:06/21/2010 17:46:49;0080;
pbs_mom;Job;994.scyld.localdomain;scan_for_terminated: job
994.scyld.localdomain task 1 terminated, sid=73977
20100621:06/21/2010 17:46:50;0008;
pbs_mom;Job;994.scyld.localdomain;job was terminated
20100621:06/21/2010 17:46:50;0001;
pbs_mom;Job;951.scyld.localdomain;BUG: mismatched jobid in preobit_reply
(994.scyld.localdomain != 951.scyld.localdomain)
This particular job isn't doing anything MPI related, and is single
threaded. I've been able to duplicate this behavior in TORQUE 2.3.10,
and even in the SVN checkout of the 'trunk' from yesterday.
Has anybody seen this, or have any idea whats going on here? I'm happy
of course to supply a patch when the problem become more problematic.
-Joshua Bernstein
Manager of Software Development
Penguin Computing
More information about the torqueusers
mailing list