[torqueusers] jobs completing with processes still running
Steve Young
chemadm at hamilton.edu
Wed May 7 09:11:36 MDT 2008
Hi Mike,
If it were me I would try running the program on one of the nodes
that isn't allocated. You could even create a system reservation so
no jobs were to get scheduled on the node while you test the
application. By running the program yourself you eliminate pbs/moab
from the problem. Once you are certain the application works as
expected then I would start using the queue system again to see if
your problem gets introduced there. Hope this helps =).
-Steve
On May 6, 2008, at 11:49 AM, Michael Robbert wrote:
> I am a new Torque user so be gentle. We are running Moab 5.2.1 and
> Torque 2.3.0 and we have a user that is submitting jobs (user
> compiled CHARMM if that matters) and often their jobs are returning
> within a few seconds and the only data in their Output/Error file
> is "Done.". The job disappears from the queue as would be expected,
> but the problem is that their code is still running on all cores of
> all nodes that they started it on. I run "mdiag -n" and see these
> nodes show up as idle but load is HIGH.
> Are there any ideas of what could cause this to happen? Should I be
> looking at their code? We only have a few users so far, but so far
> theirs is the only code doing this. What commands in Moab or Torque
> should I be using to detect and solve this issue? So far it is just
> mdiag and communication with the user. Specifically how can I find
> out the jobid of a job that has completed given that I know what
> nodes it was running on? And then how can I peer into the guts of
> that job to find out what it was doing?
> I don't expect to get an answer to the problem, but hope I can find
> out how to research it. I have searched the list archives and have
> been trying to read as much documentation as I can, but I'm still
> stumped. I just signed up for MoabCon so hopefully I'll be an
> expert after that.
>
> Thanks for any suggestions,
> Mike Robbert
> Colorado School of Mines
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list