[torqueusers] Mismatched job_id in pbs_mom

Joshua Bernstein jbernstein at penguincomputing.com
Wed Jun 23 16:44:02 MDT 2010


Hello Folks!

When testing this specific workload within TORQUE we are running into an 
issue where TORQUE (qstat) thinks there is a still a job in the running 
state (R). Though logging into that node, ps -ef, confirms that no job 
is indeed running there. Further still, the application's logfile 
confirms the job in question exited cleanly.

If we have a peek inside of the pbs_mom's log for that node, pbs_mom 
prints out a nice message saying something about a bug here. Notice the 
BUG: line in the excerpt below:


20100621:06/21/2010 17:46:48;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
994.scyld.localdomain started, pid = 73977
20100621:06/21/2010 17:46:49;0080; 
pbs_mom;Job;994.scyld.localdomain;scan_for_terminated: job 
994.scyld.localdomain task 1 terminated, sid=73977
20100621:06/21/2010 17:46:50;0008; 
pbs_mom;Job;994.scyld.localdomain;job was terminated
20100621:06/21/2010 17:46:50;0001; 
pbs_mom;Job;951.scyld.localdomain;BUG: mismatched jobid in preobit_reply 
(994.scyld.localdomain != 951.scyld.localdomain)

This particular job isn't doing anything MPI related, and is single 
threaded. I've been able to duplicate this behavior in TORQUE 2.3.10, 
and even in the SVN checkout of the 'trunk' from yesterday.

Has anybody seen this, or have any idea whats going on here? I'm happy 
of course to supply a patch when the problem become more problematic.

-Joshua Bernstein
Manager of Software Development
Penguin Computing


More information about the torqueusers mailing list