[Mauiusers] Task count greater than 4096

Steve Jones stevejones at stanford.edu
Mon Jun 21 12:49:38 MDT 2010


We've run into an issue with submitting jobs greater than 4096 on Torque/Maui combination. When submitting the following the job runs:

$ qsub -I -lnodes=170:ppn=24

When we go larger by one node:

$ qsub -I -lnodes=171:ppn=24

The job is in the blocked queue with a state of Idle and the following message in checkjob:

cannot select job 104 for partition DEFAULT (NodeCount)

I did some searching and found information about number of jobs, but not much on number of tasks per job. I tried increasing the MAX_MTASK from the default of 4096 to a higher number of 16384 to support our core count on the cluster. This works, we're able to submit jobs greater than 4096, but Maui crashes within minutes after we're submitting jobs. These are the two parameters we're changing before rebuilding Maui:

sed -i '/MMAX_JOB/ s/4096/8192/g' ./include/msched.h 
sed -i '/MAX_MTASK/ s/4096/16384/g' ./include/msched-common.h

MMAX_JOB is one we have on the current build and it doesn't have any adverse effect on Maui, it's only when we increase MAX_MTASK.

Is it possible we're missing another parameter change, or possibly a ulimit issue? Here's ulimit's on the system where Maui is running:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1196032
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 4096
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1196032
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited



