[torqueusers] resources problem

Bernd Schubert bernd-schubert at gmx.de
Wed Nov 29 07:25:34 MST 2006


we just had the problem that the job one of our group members required more 
resources (memory) than requested, but still torque didn't kill it. Also, 
qstat reports by far too low resources for this special program. For all 
other programs presently running the resources reported are fine, only this 
program is troublesome.
While looking whats so special about it,  we see its basically a mpi program 
compiled with mpicc, however, it is NOT started using mpirun, but on our 
cluster its just queued as any other program using 'qstat program_name'.

Already some time ago Garrick sent me his "dumpmom" program to analyze 
reported resources. Now using dupmom, I clearly see that qstat reports the 
resouces used by the starting bash, but doesn't count the resources used by 
the program started from this bash. However, dumpmon also additionally 
reports those data for the program/pid started from this bash.

Here I'm lost now, I have now idea what I could do or how to debug it. Is this 
a bug of mom_priv running on the nodes or is it a bug of pbs_server?

Thanks in advance,

Bernd Schubert
PCI / Theoretische Chemie
Universität Heidelberg
INF 229
69120 Heidelberg

