[torqueusers] resources problem
garrick at clusterresources.com
Wed Nov 29 12:06:27 MST 2006
On Wed, Nov 29, 2006 at 03:25:34PM +0100, Bernd Schubert alleged:
> we just had the problem that the job one of our group members required more
> resources (memory) than requested, but still torque didn't kill it. Also,
> qstat reports by far too low resources for this special program. For all
> other programs presently running the resources reported are fine, only this
> program is troublesome.
> While looking whats so special about it, we see its basically a mpi program
> compiled with mpicc, however, it is NOT started using mpirun, but on our
> cluster its just queued as any other program using 'qstat program_name'.
'qsub' can't submit binaries, so the batch script is probably running
> Already some time ago Garrick sent me his "dumpmom" program to analyze
> reported resources. Now using dupmom, I clearly see that qstat reports the
> resouces used by the starting bash, but doesn't count the resources used by
> the program started from this bash. However, dumpmon also additionally
> reports those data for the program/pid started from this bash.
> Here I'm lost now, I have now idea what I could do or how to debug it. Is this
> a bug of mom_priv running on the nodes or is it a bug of pbs_server?
If your mpirun is using rsh/ssh to spawn the remote processes, then
they won't be tracked and added to the job's usage.
You want to use OSC's mpiexec instead of the vendor mpirun.
More information about the torqueusers