[torqueusers] Mismatched job_id in pbs_mom
Joshua Bernstein
jbernstein at penguincomputing.com
Wed Jun 23 17:26:09 MDT 2010
David Beer wrote:
>
> ----- Original Message -----
>> Hello Folks!
>>
>> When testing this specific workload within TORQUE we are running into
>> an
>> issue where TORQUE (qstat) thinks there is a still a job in the
>> running
>> state (R). Though logging into that node, ps -ef, confirms that no job
>> is indeed running there. Further still, the application's logfile
>> confirms the job in question exited cleanly.
>>
>> If we have a peek inside of the pbs_mom's log for that node, pbs_mom
>> prints out a nice message saying something about a bug here. Notice
>> the
>> BUG: line in the excerpt below:
>>
>>
>> 20100621:06/21/2010 17:46:48;0001; pbs_mom;Job;TMomFinalizeJob3;job
>> 994.scyld.localdomain started, pid = 73977
>> 20100621:06/21/2010 17:46:49;0080;
>> pbs_mom;Job;994.scyld.localdomain;scan_for_terminated: job
>> 994.scyld.localdomain task 1 terminated, sid=73977
>> 20100621:06/21/2010 17:46:50;0008;
>> pbs_mom;Job;994.scyld.localdomain;job was terminated
>> 20100621:06/21/2010 17:46:50;0001;
>> pbs_mom;Job;951.scyld.localdomain;BUG: mismatched jobid in
>> preobit_reply
>> (994.scyld.localdomain != 951.scyld.localdomain)
>>
>> This particular job isn't doing anything MPI related, and is single
>> threaded. I've been able to duplicate this behavior in TORQUE 2.3.10,
>> and even in the SVN checkout of the 'trunk' from yesterday.
>>
>> Has anybody seen this, or have any idea whats going on here? I'm happy
>> of course to supply a patch when the problem become more problematic.
>
> I haven't seen this behavior, but I'd be happy to try to reproduce it. Do you have some suggestions for reproducing it?
Thanks David. I'd love to say I had an easy way to reproduce it, but it
seems to involved 1,000 of jobs on the system concurrently running that
the same time. I should add that the scheduler in this case is Maui
rather then pbs_sched or Moab.
Currently, I'm thinking there is context switch problem inside of
catch_child.c, but its just an early idea.
-Josh
More information about the torqueusers
mailing list