[torqueusers] "No such process (3) in resi_sum, ###: get_proc_stat"

Kamil Kisiel kamil at zymeworks.com
Tue Jun 24 12:27:25 MDT 2008


On 23-Jun-08, at 16:17 , Glen Beane wrote:
>
>
> On Mon, Jun 23, 2008 at 2:57 PM, Kamil Kisiel <kamil at zymeworks.com>  
> wrote:
> On 9-Jun-08, at 14:02 , Kamil Kisiel wrote:
>
>> Occasionally some of our cluster nodes send out a syslog message  
>> such as:
>>
>> node071.cluster.zymeworks.com pbs_mom: No such process (3) in  
>> resi_sum, 797: get_proc_stat
>>
>> The number after "resi_sum" is different in each message,  
>> presumably it's the PID of some process.
>>
>> What does this mean, and what could be causing it?
>
> So far I haven't had any reply to this. Nobody has any clue?
>
> How often do you see this?  I haven't had a chance to look at this  
> in detail, but what could be happening is the process with that PID  
> is dieing and resi_sum is being called before pbs_mom picks up the  
> exiting process.  If it happens often, then please provide me with  
> as much information as you can (especially TORQUE version)

It happens fairly often, I am receiving a few log messages per day. I  
haven't yet been able to determine at which portion of a job or which  
types of jobs cause it. We're using Torque 2.1.6

I also get a similar message for cput_sum

>
>
>
> I also noticed that jobs run through MPI are under-reporting the  
> cputime used in qstat output. Is that related, or a separate issue?
>
> Which MPI do you use, and which job launcher do you use?  If the job  
> launcher you use is not using TM (the task manager API provided by  
> TORQUE, OpenPBS/PBS Pro) to spawn all of the remote processes then  
> the cpu time will be under reported (these processes will be outside  
> the control of TORQUE).  If you let us know what MPI you use and  
> what job launcher you use (mpiexec/mpirun) we can know for sure if  
> this what is going on. In addition to the under reporting of cpu  
> time, using a non-TM launcher can also lead to processes that aren't  
> always cleaned up when a job crashes or is killed prematurely.
>

We're using OpenMPI 1.2.6 built with TM support. We launch with mpirun  
but as far as I am aware mpirun and mpiexec are equivalent in OpenMPI.


Notice of Confidentiality: The information transmitted is intended only for the
person or entity to which it is addressed and may contain confidential and/or
privileged material. Any review, re-transmission, dissemination or other use of
or taking of any action in reliance upon this information by persons or entities
other than the intended recipient is prohibited. If you received this in error
please contact the sender immediately by return electronic transmission and then
immediately delete this transmission including all attachments without copying,
distributing or disclosing the same.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080624/472a6c76/attachment.html


More information about the torqueusers mailing list