[Mauiusers] Bug in "-l file=XXX" options?

Michael Musson musson at clusterresources.com
Fri Jul 1 13:52:58 MDT 2005


Hi Daniel,

It's possible that this is an file-size limit imposed by the OS itself.
Try the following:

echo 'ulimit -a' | qsub -l nodes=mynode

This should output dump the following line (along with other info) to
the STDIN.o<JobID> file in the user's home directory.

"file size             (blocks, -f) <file size limit>"

We're suspecting that <file size limit> is set for your system to some
number smaller than 100mb, which would be causing the behavior you're
seeing.  You can try increasing the OS file size limit with:

ulimit -f <new size>

We've added a more verbose error message about the problem for the
future. The following will be inserted into the mom log:

"cannot set file limit to <limit> for job <X> (setrlimit failed - check
default user limits)"

We're also looking at ways of routing this message to end users.

Thanks for the message and let us know if you have further problems.

Mike M.



On Fri, 2005-07-01 at 12:11 -0400, T. Daniel Crawford wrote:
> Hi all,
> 
> We recently installed Torque (1.2.0p4) and Maui (3.2.6p13) on our research
> group's clusters of Athlons, Xeons, and Opterons (all running FC2 or FC3).
> The system has worked *great* so far, except for the apparent failure of the
> file=XXX option.  Specifically, if a user give, e.g., "-l file=140000mb" to
> qsub, then Maui appears to select the correct subset of nodes, i.e., the
> task will go only to a machine with sufficient scratch space as reported to
> the pbs_server by the node's pbs_mom.  However, the job immediately dies
> upon arrival:
> 
> 07/01/2005 12:04:34;0001;   pbs_mom;Job;TMomFinalizeJob3;job not started,
> Failure job exec failure, after files staged, no retry
> 07/01/2005 12:04:34;0001;   pbs_mom;Job;456.sirius.<censored>;ALERT:  job
> failed phase 3 start, server will retry
> 07/01/2005 12:04:34;0008;   pbs_mom;Req;send_sisters;sending ABORT to
> sisters
> 
> However, if I only request "-l file=10mb", the job runs fine.  (But "-l
> file=100mb" also fails.)
> 
> Many of our calculations require large amounts of scratch disk space.  I'd
> prefer to use the MINRESOURCE policy only because of its dynamic
> flexibility, but this bug has forced me to define partitions of nodes, which
> doesn't always provide the most balanced load across the cluster.
> 
> Any help the Maui/Torque gurus can provide would be greatly appreciated!
> 
> Thanks,
> 
> -Daniel
> 



More information about the mauiusers mailing list