[torqueusers] jobs completing with processes still running
Smith, Jerry Don II
jdsmit at sandia.gov
Wed May 7 09:21:18 MDT 2008
What job launch mechanism is the cluster using, mpirun mpiexec? You want to make sure it is one the resource manager can track processes with to ensure cleanup.
You can look in $PBS_HOME/server_priv/accounting and grep for the node in order to get the job_id. There is also a setting in moab and torque to hold completed job info for X time, so you can use checkjob and or qstat -f to do further diagnosis. I am not at my desk so I can't tell you the parameter offhand.
----- Original Message -----
From: torqueusers-bounces at supercluster.org <torqueusers-bounces at supercluster.org>
To: Michael Robbert <mrobbert at mines.edu>
Cc: Torque Users <torqueusers at supercluster.org>
Sent: Wed May 07 09:11:36 2008
Subject: Re: [torqueusers] jobs completing with processes still running
If it were me I would try running the program on one of the nodes
that isn't allocated. You could even create a system reservation so
no jobs were to get scheduled on the node while you test the
application. By running the program yourself you eliminate pbs/moab
from the problem. Once you are certain the application works as
expected then I would start using the queue system again to see if
your problem gets introduced there. Hope this helps =).
On May 6, 2008, at 11:49 AM, Michael Robbert wrote:
> I am a new Torque user so be gentle. We are running Moab 5.2.1 and
> Torque 2.3.0 and we have a user that is submitting jobs (user
> compiled CHARMM if that matters) and often their jobs are returning
> within a few seconds and the only data in their Output/Error file
> is "Done.". The job disappears from the queue as would be expected,
> but the problem is that their code is still running on all cores of
> all nodes that they started it on. I run "mdiag -n" and see these
> nodes show up as idle but load is HIGH.
> Are there any ideas of what could cause this to happen? Should I be
> looking at their code? We only have a few users so far, but so far
> theirs is the only code doing this. What commands in Moab or Torque
> should I be using to detect and solve this issue? So far it is just
> mdiag and communication with the user. Specifically how can I find
> out the jobid of a job that has completed given that I know what
> nodes it was running on? And then how can I peer into the guts of
> that job to find out what it was doing?
> I don't expect to get an answer to the problem, but hope I can find
> out how to research it. I have searched the list archives and have
> been trying to read as much documentation as I can, but I'm still
> stumped. I just signed up for MoabCon so hopefully I'll be an
> expert after that.
> Thanks for any suggestions,
> Mike Robbert
> Colorado School of Mines
> torqueusers mailing list
> torqueusers at supercluster.org
torqueusers mailing list
torqueusers at supercluster.org
More information about the torqueusers