[torqueusers] Job stdout/stderr file empty after transfer

Jan Ploski Jan.Ploski at offis.de
Tue Apr 17 01:07:40 MDT 2007


torqueusers-bounces at supercluster.org schrieb am 04/17/2007 01:46:08 AM:

> On Mon, Apr 16, 2007 at 06:25:14PM +0200, Jan Ploski alleged:
> > torqueusers-bounces at supercluster.org schrieb am 04/13/2007 03:22:45 
PM:
> > 
> > > Hello,
> > > 
> > > I am using TORQUE 2.1.6, trying to transfer stdout of a job using 
the -o 
> > 
> > > option of qsub. Unfortunately, no matter whether I transfer via scp 
or 
> > set 
> > > up $usecp, the transferred file is created with size 0 (zero). When 
I 
> > use 
> > > the option "-k oe" instead, the file remains in $HOME on the execute 

> > > machine and contains the expected output. Can anyone please explain 
this 
> > 
> > > or give a tip which log file to inspect or what experiments to 
perform 
> > to 
> > > gather more information?
> > 
> > Solved. The disk with /var/spool/torque on the execute machine was 
full.
> > 
> > I'd classify it as an error handling bug in TORQUE. We had to strace 
the 
> > child process to debug it - shouldn't be necessary.
> 
> I've wrestled with this myself on my own cluster.  It is not uncommon
> for users to fill up /var with too much stdout/stderr, fail to have the
> huge file copy to home, and then stick around in undelivered.
> 
> Clearly, the most correct thing would be to kill the job if writes to
> spool files fail, but users don't necessarily consider this to be a
> fatal condition for their job. 
> 
> On my own cluster, some users are just sending debug info to
> stdout/stderr, and their *real* output is going to a different file, in
> this case the users prefer that the system do everything possible to
> keep the job running.

Garrick,

I agree with you that the error handling decision might depend on context 
- they often do, in general. One way to deal with it would be to provide 
an option for the user to provide or choose their error handler. However, 
my main wish would be for the "cannot write to disk" to be logged 
prominently somewhere. This should be easy to implement and delegate the 
"handling" to an external agent (the user or some log monitoring 
software). Perhaps the real current weakness is (lack of) communication of 
errors from pbs_mom's child process to its parent?

Best regards -
Jan Ploski

--
Dipl.-Inform. (FH) Jan Ploski
OFFIS
Betriebliches Informationsmanagement
Escherweg 2  - 26121 Oldenburg - Germany
Fon: +49 441 9722 - 184 Fax: +49 441 9722 - 202
E-Mail: Jan.Ploski at offis.de - URL: http://www.offis.de


More information about the torqueusers mailing list