[torqueusers] resources_used. mem problems
sm4082 at nyu.edu
Thu Oct 18 07:58:41 MDT 2012
We have Torque 2.5.12 on one of our new cluster. OS is Red Hat Enterprise Linux Server release 6.2 (Santiago). We installed OpenMPI version 1.4.5 (compiled with intel compilers).
Strangely, with our parallel jobs that are using OpenMPI 1.4.5 are reporting resources_used. men as a sum of the memory being used on all the nodes in the job in stead of reporting the memory that's being used just on mother superior node (rank=0). But if we run the same job with MVAPICH2 then we are seeing the values only from the node with rank=0 for resources_used.mem. Where as on our old clusters, with version1.4.3 and Torque 2.5.11 we are seeing the values just from mother superior node (rank=0).
Overall, this is very problematic because we ask Moab/Torque to kill the jobs that use the memory more than they requested or are allocated. We use qsub wrapper to define memory for each and every job just to avoid node crashing, etc, etc. Since it is reporting all the memory that's being used on all the nodes (let's say 100 nodes), the sum is huge and it's way bigger than the memory on each individual node and so job is getting killed saying that it has exceeded the memory allocated.
Has anyone seen this behavior on your clusters? Given that it is working fine with MVAPICH2 I'm thinking it has to do with OpenMPI 1.4.5 (as it works fine with 1.4.3). We are testing 1.4.3 on our new clusters and plan to test 1.4.5 on our old clusters. But I thought it'd be useful to know whether anyone has any thoughts on it. Please let me know.
More information about the torqueusers