[torqueusers] "No such process (3) in resi_sum,
kamil at zymeworks.com
Mon Jun 30 14:42:57 MDT 2008
On 24-Jun-08, at 11:27 , Kamil Kisiel wrote:
> On 23-Jun-08, at 16:17 , Glen Beane wrote:
>> On Mon, Jun 23, 2008 at 2:57 PM, Kamil Kisiel <kamil at zymeworks.com>
>> On 9-Jun-08, at 14:02 , Kamil Kisiel wrote:
>>> Occasionally some of our cluster nodes send out a syslog message
>>> such as:
>>> node071.cluster.zymeworks.com pbs_mom: No such process (3) in
>>> resi_sum, 797: get_proc_stat
>>> The number after "resi_sum" is different in each message,
>>> presumably it's the PID of some process.
>>> What does this mean, and what could be causing it?
>> So far I haven't had any reply to this. Nobody has any clue?
>> How often do you see this? I haven't had a chance to look at this
>> in detail, but what could be happening is the process with that PID
>> is dieing and resi_sum is being called before pbs_mom picks up the
>> exiting process. If it happens often, then please provide me with
>> as much information as you can (especially TORQUE version)
> It happens fairly often, I am receiving a few log messages per day.
> I haven't yet been able to determine at which portion of a job or
> which types of jobs cause it. We're using Torque 2.1.6
> I also get a similar message for cput_sum
>> I also noticed that jobs run through MPI are under-reporting the
>> cputime used in qstat output. Is that related, or a separate issue?
>> Which MPI do you use, and which job launcher do you use? If the
>> job launcher you use is not using TM (the task manager API provided
>> by TORQUE, OpenPBS/PBS Pro) to spawn all of the remote processes
>> then the cpu time will be under reported (these processes will be
>> outside the control of TORQUE). If you let us know what MPI you
>> use and what job launcher you use (mpiexec/mpirun) we can know for
>> sure if this what is going on. In addition to the under reporting
>> of cpu time, using a non-TM launcher can also lead to processes
>> that aren't always cleaned up when a job crashes or is killed
> We're using OpenMPI 1.2.6 built with TM support. We launch with
> mpirun but as far as I am aware mpirun and mpiexec are equivalent in
> Notice of Confidentiality: The information transmitted is intended
> only for the person or entity to which it is addressed and may
> contain confidential and/or privileged material. Any review, re-
> transmission, dissemination or other use of or taking of any action
> in reliance upon this information by persons or entities other than
> the intended recipient is prohibited. If you received this in error
> please contact the sender immediately by return electronic
> transmission and then immediately delete this transmission including
> all attachments without copying, distributing or disclosing the same.
> torqueusers mailing list
> torqueusers at supercluster.org
We're still seeing this numerous times every single day. Any help at
all would be appreciated.
HPC Systems Engineer, Zymeworks Inc.
201-1401 West Broadway,
Vancouver, BC, V6H 1H6, Canada
Tel: (604) 678-1388 ext. 135
Fax: (604) 737-7077
Notice of Confidentiality: The information transmitted is intended only for the
person or entity to which it is addressed and may contain confidential and/or
privileged material. Any review, re-transmission, dissemination or other use of
or taking of any action in reliance upon this information by persons or entities
other than the intended recipient is prohibited. If you received this in error
please contact the sender immediately by return electronic transmission and then
immediately delete this transmission including all attachments without copying,
distributing or disclosing the same.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers