[torqueusers] Job stdout/stderr file empty after transfer
garrick at clusterresources.com
Mon Apr 16 17:46:08 MDT 2007
On Mon, Apr 16, 2007 at 06:25:14PM +0200, Jan Ploski alleged:
> torqueusers-bounces at supercluster.org schrieb am 04/13/2007 03:22:45 PM:
> > Hello,
> > I am using TORQUE 2.1.6, trying to transfer stdout of a job using the -o
> > option of qsub. Unfortunately, no matter whether I transfer via scp or
> > up $usecp, the transferred file is created with size 0 (zero). When I
> > the option "-k oe" instead, the file remains in $HOME on the execute
> > machine and contains the expected output. Can anyone please explain this
> > or give a tip which log file to inspect or what experiments to perform
> > gather more information?
> Solved. The disk with /var/spool/torque on the execute machine was full.
> I'd classify it as an error handling bug in TORQUE. We had to strace the
> child process to debug it - shouldn't be necessary.
I've wrestled with this myself on my own cluster. It is not uncommon
for users to fill up /var with too much stdout/stderr, fail to have the
huge file copy to home, and then stick around in undelivered.
Clearly, the most correct thing would be to kill the job if writes to
spool files fail, but users don't necessarily consider this to be a
fatal condition for their job.
On my own cluster, some users are just sending debug info to
stdout/stderr, and their *real* output is going to a different file, in
this case the users prefer that the system do everything possible to
keep the job running.
More information about the torqueusers