[torqueusers] Job stdout/stderr file empty after transfer
Jan.Ploski at offis.de
Tue Apr 17 01:07:40 MDT 2007
torqueusers-bounces at supercluster.org schrieb am 04/17/2007 01:46:08 AM:
> On Mon, Apr 16, 2007 at 06:25:14PM +0200, Jan Ploski alleged:
> > torqueusers-bounces at supercluster.org schrieb am 04/13/2007 03:22:45
> > > Hello,
> > >
> > > I am using TORQUE 2.1.6, trying to transfer stdout of a job using
> > > option of qsub. Unfortunately, no matter whether I transfer via scp
> > set
> > > up $usecp, the transferred file is created with size 0 (zero). When
> > use
> > > the option "-k oe" instead, the file remains in $HOME on the execute
> > > machine and contains the expected output. Can anyone please explain
> > > or give a tip which log file to inspect or what experiments to
> > to
> > > gather more information?
> > Solved. The disk with /var/spool/torque on the execute machine was
> > I'd classify it as an error handling bug in TORQUE. We had to strace
> > child process to debug it - shouldn't be necessary.
> I've wrestled with this myself on my own cluster. It is not uncommon
> for users to fill up /var with too much stdout/stderr, fail to have the
> huge file copy to home, and then stick around in undelivered.
> Clearly, the most correct thing would be to kill the job if writes to
> spool files fail, but users don't necessarily consider this to be a
> fatal condition for their job.
> On my own cluster, some users are just sending debug info to
> stdout/stderr, and their *real* output is going to a different file, in
> this case the users prefer that the system do everything possible to
> keep the job running.
I agree with you that the error handling decision might depend on context
- they often do, in general. One way to deal with it would be to provide
an option for the user to provide or choose their error handler. However,
my main wish would be for the "cannot write to disk" to be logged
prominently somewhere. This should be easy to implement and delegate the
"handling" to an external agent (the user or some log monitoring
software). Perhaps the real current weakness is (lack of) communication of
errors from pbs_mom's child process to its parent?
Best regards -
Dipl.-Inform. (FH) Jan Ploski
Escherweg 2 - 26121 Oldenburg - Germany
Fon: +49 441 9722 - 184 Fax: +49 441 9722 - 202
E-Mail: Jan.Ploski at offis.de - URL: http://www.offis.de
More information about the torqueusers