[torqueusers] jobs completing with processes still running

Michael Robbert mrobbert at mines.edu
Tue May 6 09:49:34 MDT 2008


I am a new Torque user so be gentle. We are running Moab 5.2.1 and 
Torque 2.3.0 and we have a user that is submitting jobs (user compiled 
CHARMM if that matters) and often their jobs are returning within a few 
seconds and the only data in their Output/Error file is "Done.". The job 
disappears from the queue as would be expected, but the problem is that 
their code is still running on all cores of all nodes that they started 
it on. I run "mdiag -n" and see these nodes show up as idle but load is 
HIGH.
Are there any ideas of what could cause this to happen? Should I be 
looking at their code? We only have a few users so far, but so far 
theirs is the only code doing this. What commands in Moab or Torque 
should I be using to detect and solve this issue? So far it is just 
mdiag and communication with the user. Specifically how can I find out 
the jobid of a job that has completed given that I know what nodes it 
was running on? And then how can I peer into the guts of that job to 
find out what it was doing?
I don't expect to get an answer to the problem, but hope I can find out 
how to research it. I have searched the list archives and have been 
trying to read as much documentation as I can, but I'm still stumped. I 
just signed up for MoabCon so hopefully I'll be an expert after that.

Thanks for any suggestions,
Mike Robbert
Colorado School of Mines


More information about the torqueusers mailing list