[torqueusers] jobs completing with processes still running
Michael Robbert
mrobbert at mines.edu
Tue May 6 09:49:34 MDT 2008
I am a new Torque user so be gentle. We are running Moab 5.2.1 and
Torque 2.3.0 and we have a user that is submitting jobs (user compiled
CHARMM if that matters) and often their jobs are returning within a few
seconds and the only data in their Output/Error file is "Done.". The job
disappears from the queue as would be expected, but the problem is that
their code is still running on all cores of all nodes that they started
it on. I run "mdiag -n" and see these nodes show up as idle but load is
HIGH.
Are there any ideas of what could cause this to happen? Should I be
looking at their code? We only have a few users so far, but so far
theirs is the only code doing this. What commands in Moab or Torque
should I be using to detect and solve this issue? So far it is just
mdiag and communication with the user. Specifically how can I find out
the jobid of a job that has completed given that I know what nodes it
was running on? And then how can I peer into the guts of that job to
find out what it was doing?
I don't expect to get an answer to the problem, but hope I can find out
how to research it. I have searched the list archives and have been
trying to read as much documentation as I can, but I'm still stumped. I
just signed up for MoabCon so hopefully I'll be an expert after that.
Thanks for any suggestions,
Mike Robbert
Colorado School of Mines
More information about the torqueusers
mailing list