[Mauiusers] Bug in "-l file=XXX" options?
T. Daniel Crawford
crawdad at exchange.vt.edu
Fri Jul 1 13:58:20 MDT 2005
Thanks for your quick reply. My nodes are packed with jobs right now, but I
rsh'ed to a few to check the limits. The following is typical:
stacksize 10240 kbytes
coredumpsize 1024 kbytes
memorylocked 32 kbytes
Note that the jobs submitted with file=140000mb will run perfectly
interactively (i.e., ignoring the batch queue), but that it doesn't actually
appear that the job really starts running on the node at all (no output
appears, for example).
On 7/1/05 3:52 PM, "Michael Musson" <musson at clusterresources.com> wrote:
> Hi Daniel,
> It's possible that this is an file-size limit imposed by the OS itself.
> Try the following:
> echo 'ulimit -a' | qsub -l nodes=mynode
> This should output dump the following line (along with other info) to
> the STDIN.o<JobID> file in the user's home directory.
> "file size (blocks, -f) <file size limit>"
> We're suspecting that <file size limit> is set for your system to some
> number smaller than 100mb, which would be causing the behavior you're
> seeing. You can try increasing the OS file size limit with:
> ulimit -f <new size>
> We've added a more verbose error message about the problem for the
> future. The following will be inserted into the mom log:
> "cannot set file limit to <limit> for job <X> (setrlimit failed - check
> default user limits)"
> We're also looking at ways of routing this message to end users.
> Thanks for the message and let us know if you have further problems.
> Mike M.
> On Fri, 2005-07-01 at 12:11 -0400, T. Daniel Crawford wrote:
>> Hi all,
>> We recently installed Torque (1.2.0p4) and Maui (3.2.6p13) on our research
>> group's clusters of Athlons, Xeons, and Opterons (all running FC2 or FC3).
>> The system has worked *great* so far, except for the apparent failure of the
>> file=XXX option. Specifically, if a user give, e.g., "-l file=140000mb" to
>> qsub, then Maui appears to select the correct subset of nodes, i.e., the
>> task will go only to a machine with sufficient scratch space as reported to
>> the pbs_server by the node's pbs_mom. However, the job immediately dies
>> upon arrival:
>> 07/01/2005 12:04:34;0001; pbs_mom;Job;TMomFinalizeJob3;job not started,
>> Failure job exec failure, after files staged, no retry
>> 07/01/2005 12:04:34;0001; pbs_mom;Job;456.sirius.<censored>;ALERT: job
>> failed phase 3 start, server will retry
>> 07/01/2005 12:04:34;0008; pbs_mom;Req;send_sisters;sending ABORT to
>> However, if I only request "-l file=10mb", the job runs fine. (But "-l
>> file=100mb" also fails.)
>> Many of our calculations require large amounts of scratch disk space. I'd
>> prefer to use the MINRESOURCE policy only because of its dynamic
>> flexibility, but this bug has forced me to define partitions of nodes, which
>> doesn't always provide the most balanced load across the cluster.
>> Any help the Maui/Torque gurus can provide would be greatly appreciated!
T. Daniel Crawford Department of Chemistry
crawdad at vt.edu Virginia Tech
www.chem.vt.edu/faculty/crawford.php Voice: 540-231-7760 FAX: 540-231-3255
PGP Public Key at: http://www.chem.vt.edu/chem-dept/crawford/publickey.txt
More information about the mauiusers