[torqueusers] Mismatched job_id in pbs_mom

Joshua Bernstein jbernstein at penguincomputing.com
Wed Jun 23 17:26:09 MDT 2010



David Beer wrote:
> 
> ----- Original Message -----
>> Hello Folks!
>>
>> When testing this specific workload within TORQUE we are running into
>> an
>> issue where TORQUE (qstat) thinks there is a still a job in the
>> running
>> state (R). Though logging into that node, ps -ef, confirms that no job
>> is indeed running there. Further still, the application's logfile
>> confirms the job in question exited cleanly.
>>
>> If we have a peek inside of the pbs_mom's log for that node, pbs_mom
>> prints out a nice message saying something about a bug here. Notice
>> the
>> BUG: line in the excerpt below:
>>
>>
>> 20100621:06/21/2010 17:46:48;0001; pbs_mom;Job;TMomFinalizeJob3;job
>> 994.scyld.localdomain started, pid = 73977
>> 20100621:06/21/2010 17:46:49;0080;
>> pbs_mom;Job;994.scyld.localdomain;scan_for_terminated: job
>> 994.scyld.localdomain task 1 terminated, sid=73977
>> 20100621:06/21/2010 17:46:50;0008;
>> pbs_mom;Job;994.scyld.localdomain;job was terminated
>> 20100621:06/21/2010 17:46:50;0001;
>> pbs_mom;Job;951.scyld.localdomain;BUG: mismatched jobid in
>> preobit_reply
>> (994.scyld.localdomain != 951.scyld.localdomain)
>>
>> This particular job isn't doing anything MPI related, and is single
>> threaded. I've been able to duplicate this behavior in TORQUE 2.3.10,
>> and even in the SVN checkout of the 'trunk' from yesterday.
>>
>> Has anybody seen this, or have any idea whats going on here? I'm happy
>> of course to supply a patch when the problem become more problematic.
> 
> I haven't seen this behavior, but I'd be happy to try to reproduce it. Do you have some suggestions for reproducing it?

Thanks David. I'd love to say I had an easy way to reproduce it, but it 
seems to involved 1,000 of jobs on the system concurrently running that 
the same time. I should add that the scheduler in this case is Maui 
rather then pbs_sched or Moab.

Currently, I'm thinking there is context switch problem inside of 
catch_child.c, but its just an early idea.

-Josh


More information about the torqueusers mailing list