[torqueusers] Mismatched job_id in pbs_mom
knielson at adaptivecomputing.com
Thu Jun 24 09:31:50 MDT 2010
On 06/23/2010 05:26 PM, Joshua Bernstein wrote:
> David Beer wrote:
>> ----- Original Message -----
>>> Hello Folks!
>>> When testing this specific workload within TORQUE we are running into
>>> issue where TORQUE (qstat) thinks there is a still a job in the
>>> state (R). Though logging into that node, ps -ef, confirms that no job
>>> is indeed running there. Further still, the application's logfile
>>> confirms the job in question exited cleanly.
>>> If we have a peek inside of the pbs_mom's log for that node, pbs_mom
>>> prints out a nice message saying something about a bug here. Notice
>>> BUG: line in the excerpt below:
>>> 20100621:06/21/2010 17:46:48;0001; pbs_mom;Job;TMomFinalizeJob3;job
>>> 994.scyld.localdomain started, pid = 73977
>>> 20100621:06/21/2010 17:46:49;0080;
>>> pbs_mom;Job;994.scyld.localdomain;scan_for_terminated: job
>>> 994.scyld.localdomain task 1 terminated, sid=73977
>>> 20100621:06/21/2010 17:46:50;0008;
>>> pbs_mom;Job;994.scyld.localdomain;job was terminated
>>> 20100621:06/21/2010 17:46:50;0001;
>>> pbs_mom;Job;951.scyld.localdomain;BUG: mismatched jobid in
>>> (994.scyld.localdomain != 951.scyld.localdomain)
>>> This particular job isn't doing anything MPI related, and is single
>>> threaded. I've been able to duplicate this behavior in TORQUE 2.3.10,
>>> and even in the SVN checkout of the 'trunk' from yesterday.
>>> Has anybody seen this, or have any idea whats going on here? I'm happy
>>> of course to supply a patch when the problem become more problematic.
>> I haven't seen this behavior, but I'd be happy to try to reproduce it. Do you have some suggestions for reproducing it?
> Thanks David. I'd love to say I had an easy way to reproduce it, but it
> seems to involved 1,000 of jobs on the system concurrently running that
> the same time. I should add that the scheduler in this case is Maui
> rather then pbs_sched or Moab.
> Currently, I'm thinking there is context switch problem inside of
> catch_child.c, but its just an early idea.
With over a thousand jobs running I would not be surprised if we were
caught by a race condition. TORQUE global variables are completely
unprotected and I am surprised we do not run into more problems.
More information about the torqueusers