[torqueusers] "No such process (3) in resi_sum, ###: get_proc_stat"

Glen Beane glen.beane at gmail.com
Mon Jun 23 17:17:23 MDT 2008


On Mon, Jun 23, 2008 at 2:57 PM, Kamil Kisiel <kamil at zymeworks.com> wrote:

>   On 9-Jun-08, at 14:02 , Kamil Kisiel wrote:
>
>  Occasionally some of our cluster nodes send out a syslog message such as:
>
> node071.cluster.zymeworks.com pbs_mom: No such process (3) in resi_sum,
> 797: get_proc_stat
>
> The number after "resi_sum" is different in each message, presumably it's
> the PID of some process.
>
> What does this mean, and what could be causing it?
>
>
> So far I haven't had any reply to this. Nobody has any clue?
>

How often do you see this?  I haven't had a chance to look at this in
detail, but what could be happening is the process with that PID is dieing
and resi_sum is being called before pbs_mom picks up the exiting process.
If it happens often, then please provide me with as much information as you
can (especially TORQUE version)


>
>
> I also noticed that jobs run through MPI are under-reporting the cputime
> used in qstat output. Is that related, or a separate issue?
>

Which MPI do you use, and which job launcher do you use?  If the job
launcher you use is not using TM (the task manager API provided by TORQUE,
OpenPBS/PBS Pro) to spawn all of the remote processes then the cpu time will
be under reported (these processes will be outside the control of TORQUE).
If you let us know what MPI you use and what job launcher you use
(mpiexec/mpirun) we can know for sure if this what is going on. In addition
to the under reporting of cpu time, using a non-TM launcher can also lead to
processes that aren't always cleaned up when a job crashes or is killed
prematurely.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080623/1aeba24c/attachment.html


More information about the torqueusers mailing list