[torqueusers] Mismatched job_id in pbs_mom

Joshua Bernstein jbernstein at penguincomputing.com
Wed Jun 30 15:33:34 MDT 2010


As a quick follow up to my post, it seems that while the error message 
in TORQUE would suggest a bug, the problem turned out to be in some 
other piece of software. Of interest in addition to the logs I first 
showed, we saw these errors as well in /var/log/messages:

Jun 28 12:42:10 n1 Jun 28 12:42:10 pbs_mom: 
LOG_ERROR::Resource temporarily unavailable (11) in fork_me, fork failed
Jun 28 13:21:09 n4 Jun 28 13:21:09 pbs_mom: 
LOG_ERROR::Resource temporarily unavailable (11) in fork_me, fork failed
Jun 28 13:21:09 n4 Jun 28 13:21:09 pbs_mom: 
LOG_ERROR::fork_to_user, forked failed, errno=25 (Inappropriate ioctl 
for device)
Jun 28 13:21:09 n4 Jun 28 13:21:09 pbs_mom: 
LOG_ERROR::Inappropriate ioctl for device (25) in req_cpyfile, 
fork_to_user failed with rc=-15010 'forked failed, errno=25 
(Inappropriate ioctl for device)' - returning failure
Jun 28 13:21:10 n0 Jun 28 13:21:10 pbs_mom: 
LOG_ERROR::Resource temporarily unavailable (11) in fork_me, fork failed

It turned out to be an issue with a new signal being passed up from 
copy_process() inside of the kernel. The error was around the kernel 
refusing to do the fork while a signal was pending for the parent. A 
quick patch to some of our code cleaned up the problem.

-Joshua Bernstein
Software Development Manager
Penguin Computing

Ken Nielson wrote:
> On 06/23/2010 05:26 PM, Joshua Bernstein wrote:
>> David Beer wrote:
>>> ----- Original Message -----
>>>> Hello Folks!
>>>> When testing this specific workload within TORQUE we are running into
>>>> an
>>>> issue where TORQUE (qstat) thinks there is a still a job in the
>>>> running
>>>> state (R). Though logging into that node, ps -ef, confirms that no job
>>>> is indeed running there. Further still, the application's logfile
>>>> confirms the job in question exited cleanly.
>>>> If we have a peek inside of the pbs_mom's log for that node, pbs_mom
>>>> prints out a nice message saying something about a bug here. Notice
>>>> the
>>>> BUG: line in the excerpt below:
>>>> 20100621:06/21/2010 17:46:48;0001; pbs_mom;Job;TMomFinalizeJob3;job
>>>> 994.scyld.localdomain started, pid = 73977
>>>> 20100621:06/21/2010 17:46:49;0080;
>>>> pbs_mom;Job;994.scyld.localdomain;scan_for_terminated: job
>>>> 994.scyld.localdomain task 1 terminated, sid=73977
>>>> 20100621:06/21/2010 17:46:50;0008;
>>>> pbs_mom;Job;994.scyld.localdomain;job was terminated
>>>> 20100621:06/21/2010 17:46:50;0001;
>>>> pbs_mom;Job;951.scyld.localdomain;BUG: mismatched jobid in
>>>> preobit_reply
>>>> (994.scyld.localdomain != 951.scyld.localdomain)
>>>> This particular job isn't doing anything MPI related, and is single
>>>> threaded. I've been able to duplicate this behavior in TORQUE 2.3.10,
>>>> and even in the SVN checkout of the 'trunk' from yesterday.
>>>> Has anybody seen this, or have any idea whats going on here? I'm happy
>>>> of course to supply a patch when the problem become more problematic.
>>> I haven't seen this behavior, but I'd be happy to try to reproduce it. Do you have some suggestions for reproducing it?
>> Thanks David. I'd love to say I had an easy way to reproduce it, but it
>> seems to involved 1,000 of jobs on the system concurrently running that
>> the same time. I should add that the scheduler in this case is Maui
>> rather then pbs_sched or Moab.
>> Currently, I'm thinking there is context switch problem inside of
>> catch_child.c, but its just an early idea.
>> -Josh
> With over a thousand jobs running I would not be surprised if we were 
> caught by a race condition. TORQUE global variables are completely 
> unprotected and I am surprised we do not run into more problems.
> Ken
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

More information about the torqueusers mailing list