[torqueusers] Mismatched job_id in pbs_mom
dbeer at adaptivecomputing.com
Wed Jun 23 16:48:29 MDT 2010
----- Original Message -----
> Hello Folks!
> When testing this specific workload within TORQUE we are running into
> issue where TORQUE (qstat) thinks there is a still a job in the
> state (R). Though logging into that node, ps -ef, confirms that no job
> is indeed running there. Further still, the application's logfile
> confirms the job in question exited cleanly.
> If we have a peek inside of the pbs_mom's log for that node, pbs_mom
> prints out a nice message saying something about a bug here. Notice
> BUG: line in the excerpt below:
> 20100621:06/21/2010 17:46:48;0001; pbs_mom;Job;TMomFinalizeJob3;job
> 994.scyld.localdomain started, pid = 73977
> 20100621:06/21/2010 17:46:49;0080;
> pbs_mom;Job;994.scyld.localdomain;scan_for_terminated: job
> 994.scyld.localdomain task 1 terminated, sid=73977
> 20100621:06/21/2010 17:46:50;0008;
> pbs_mom;Job;994.scyld.localdomain;job was terminated
> 20100621:06/21/2010 17:46:50;0001;
> pbs_mom;Job;951.scyld.localdomain;BUG: mismatched jobid in
> (994.scyld.localdomain != 951.scyld.localdomain)
> This particular job isn't doing anything MPI related, and is single
> threaded. I've been able to duplicate this behavior in TORQUE 2.3.10,
> and even in the SVN checkout of the 'trunk' from yesterday.
> Has anybody seen this, or have any idea whats going on here? I'm happy
> of course to supply a patch when the problem become more problematic.
I haven't seen this behavior, but I'd be happy to try to reproduce it. Do you have some suggestions for reproducing it?
> -Joshua Bernstein
> Manager of Software Development
> Penguin Computing
> torqueusers mailing list
> torqueusers at supercluster.org
David Beer | Senior Software Engineer
More information about the torqueusers