[Mauiusers] Bug in "-l file=XXX" options?

T. Daniel Crawford crawdad at exchange.vt.edu
Fri Jul 1 13:58:20 MDT 2005


Thanks for your quick reply.  My nodes are packed with jobs right now, but I
rsh'ed to a few to check the limits.  The following is typical:

cputime      unlimited
filesize     unlimited
datasize     unlimited
stacksize    10240 kbytes
coredumpsize 1024 kbytes
memoryuse    unlimited
vmemoryuse   unlimited
descriptors  1024 
memorylocked 32 kbytes
maxproc      32760 

Note that the jobs submitted with file=140000mb will run perfectly
interactively (i.e., ignoring the batch queue), but that it doesn't actually
appear that the job really starts running on the node at all (no output
appears, for example).


On 7/1/05 3:52 PM, "Michael Musson" <musson at clusterresources.com> wrote:

> Hi Daniel,
> It's possible that this is an file-size limit imposed by the OS itself.
> Try the following:
> echo 'ulimit -a' | qsub -l nodes=mynode
> This should output dump the following line (along with other info) to
> the STDIN.o<JobID> file in the user's home directory.
> "file size             (blocks, -f) <file size limit>"
> We're suspecting that <file size limit> is set for your system to some
> number smaller than 100mb, which would be causing the behavior you're
> seeing.  You can try increasing the OS file size limit with:
> ulimit -f <new size>
> We've added a more verbose error message about the problem for the
> future. The following will be inserted into the mom log:
> "cannot set file limit to <limit> for job <X> (setrlimit failed - check
> default user limits)"
> We're also looking at ways of routing this message to end users.
> Thanks for the message and let us know if you have further problems.
> Mike M.
> On Fri, 2005-07-01 at 12:11 -0400, T. Daniel Crawford wrote:
>> Hi all,
>> We recently installed Torque (1.2.0p4) and Maui (3.2.6p13) on our research
>> group's clusters of Athlons, Xeons, and Opterons (all running FC2 or FC3).
>> The system has worked *great* so far, except for the apparent failure of the
>> file=XXX option.  Specifically, if a user give, e.g., "-l file=140000mb" to
>> qsub, then Maui appears to select the correct subset of nodes, i.e., the
>> task will go only to a machine with sufficient scratch space as reported to
>> the pbs_server by the node's pbs_mom.  However, the job immediately dies
>> upon arrival:
>> 07/01/2005 12:04:34;0001;   pbs_mom;Job;TMomFinalizeJob3;job not started,
>> Failure job exec failure, after files staged, no retry
>> 07/01/2005 12:04:34;0001;   pbs_mom;Job;456.sirius.<censored>;ALERT:  job
>> failed phase 3 start, server will retry
>> 07/01/2005 12:04:34;0008;   pbs_mom;Req;send_sisters;sending ABORT to
>> sisters
>> However, if I only request "-l file=10mb", the job runs fine.  (But "-l
>> file=100mb" also fails.)
>> Many of our calculations require large amounts of scratch disk space.  I'd
>> prefer to use the MINRESOURCE policy only because of its dynamic
>> flexibility, but this bug has forced me to define partitions of nodes, which
>> doesn't always provide the most balanced load across the cluster.
>> Any help the Maui/Torque gurus can provide would be greatly appreciated!
>> Thanks,
>> -Daniel

T. Daniel Crawford                           Department of Chemistry
crawdad at vt.edu                                    Virginia Tech
www.chem.vt.edu/faculty/crawford.php  Voice: 540-231-7760  FAX: 540-231-3255
 PGP Public Key at: http://www.chem.vt.edu/chem-dept/crawford/publickey.txt

More information about the mauiusers mailing list