[Mauiusers] Bug in "-l file=XXX" options?

T. Daniel Crawford crawdad at exchange.vt.edu
Fri Jul 1 10:11:17 MDT 2005


Hi all,

We recently installed Torque (1.2.0p4) and Maui (3.2.6p13) on our research
group's clusters of Athlons, Xeons, and Opterons (all running FC2 or FC3).
The system has worked *great* so far, except for the apparent failure of the
file=XXX option.  Specifically, if a user give, e.g., "-l file=140000mb" to
qsub, then Maui appears to select the correct subset of nodes, i.e., the
task will go only to a machine with sufficient scratch space as reported to
the pbs_server by the node's pbs_mom.  However, the job immediately dies
upon arrival:

07/01/2005 12:04:34;0001;   pbs_mom;Job;TMomFinalizeJob3;job not started,
Failure job exec failure, after files staged, no retry
07/01/2005 12:04:34;0001;   pbs_mom;Job;456.sirius.<censored>;ALERT:  job
failed phase 3 start, server will retry
07/01/2005 12:04:34;0008;   pbs_mom;Req;send_sisters;sending ABORT to
sisters

However, if I only request "-l file=10mb", the job runs fine.  (But "-l
file=100mb" also fails.)

Many of our calculations require large amounts of scratch disk space.  I'd
prefer to use the MINRESOURCE policy only because of its dynamic
flexibility, but this bug has forced me to define partitions of nodes, which
doesn't always provide the most balanced load across the cluster.

Any help the Maui/Torque gurus can provide would be greatly appreciated!

Thanks,

-Daniel

-- 
T. Daniel Crawford                           Department of Chemistry
crawdad at vt.edu                                    Virginia Tech
www.chem.vt.edu/faculty/crawford.php  Voice: 540-231-7760  FAX: 540-231-3255
                            --------------------
 PGP Public Key at: http://www.chem.vt.edu/chem-dept/crawford/publickey.txt



More information about the mauiusers mailing list