[torqueusers] Mismatched job_id in pbs_mom

David Beer dbeer at adaptivecomputing.com
Wed Jun 23 16:48:29 MDT 2010



----- Original Message -----
> Hello Folks!
> 
> When testing this specific workload within TORQUE we are running into
> an
> issue where TORQUE (qstat) thinks there is a still a job in the
> running
> state (R). Though logging into that node, ps -ef, confirms that no job
> is indeed running there. Further still, the application's logfile
> confirms the job in question exited cleanly.
> 
> If we have a peek inside of the pbs_mom's log for that node, pbs_mom
> prints out a nice message saying something about a bug here. Notice
> the
> BUG: line in the excerpt below:
> 
> 
> 20100621:06/21/2010 17:46:48;0001; pbs_mom;Job;TMomFinalizeJob3;job
> 994.scyld.localdomain started, pid = 73977
> 20100621:06/21/2010 17:46:49;0080;
> pbs_mom;Job;994.scyld.localdomain;scan_for_terminated: job
> 994.scyld.localdomain task 1 terminated, sid=73977
> 20100621:06/21/2010 17:46:50;0008;
> pbs_mom;Job;994.scyld.localdomain;job was terminated
> 20100621:06/21/2010 17:46:50;0001;
> pbs_mom;Job;951.scyld.localdomain;BUG: mismatched jobid in
> preobit_reply
> (994.scyld.localdomain != 951.scyld.localdomain)
> 
> This particular job isn't doing anything MPI related, and is single
> threaded. I've been able to duplicate this behavior in TORQUE 2.3.10,
> and even in the SVN checkout of the 'trunk' from yesterday.
> 
> Has anybody seen this, or have any idea whats going on here? I'm happy
> of course to supply a patch when the problem become more problematic.

I haven't seen this behavior, but I'd be happy to try to reproduce it. Do you have some suggestions for reproducing it?

> 
> -Joshua Bernstein
> Manager of Software Development
> Penguin Computing
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-- 
David Beer | Senior Software Engineer
Adaptive Computing


More information about the torqueusers mailing list