[torqueusers] Job stdout/stderr file empty after transfer
David Singleton
David.Singleton at anu.edu.au
Mon Apr 16 22:44:04 MDT 2007
This is the essence of a crude bit of code we stuck in
mom_over_limit() to clobber jobs writing too much to
/var for some random value of "too much".
David
/* This should be a mom config option */
#define CHECKVAR
#if !defined(NO_SPOOL_OUTPUT) && defined(CHECKVAR)
#define VARSPOOLUSERLIM_KB 20480
/* check file sizes in PBS spool area */
if (pjob->ji_qs.ji_svrflags&JOB_SVFLG_HERE) { // only on MS
char path[MAXPATHLEN];
char *suf;
struct stat sbuf;
(void)strcpy(path, path_spool);
(void)strcat(path, pjob->ji_qs.ji_fileprefix);
suf = path+strlen(path);
(void)strcat(path, JOB_STDOUT_SUFFIX);
if ( (stat(path, &sbuf)==0) &&
(sbuf.st_size>>10 > (off_t)VARSPOOLUSERLIM_KB) ){
sprintf(log_buffer, "stdout file size %luKB exceeds limit %luKB",
((unsigned long)(sbuf.st_size>>10)), (unsigned long)VARSPOOLUSERLIM_KB);
return (JOB_SVFLG_OVERLMT);
}
(void)strcpy(suf, JOB_STDERR_SUFFIX);
if ( (stat(path, &sbuf)==0) &&
(sbuf.st_size>>10 > (off_t)VARSPOOLUSERLIM_KB) ){
sprintf(log_buffer, "stderr file size %luKB exceeds limit %luKB",
((unsigned long)(sbuf.st_size>>10)), (unsigned long)VARSPOOLUSERLIM_KB);
return (JOB_SVFLG_OVERLMT);
}
}
#endif
Garrick Staples wrote:
> On Mon, Apr 16, 2007 at 06:25:14PM +0200, Jan Ploski alleged:
>> torqueusers-bounces at supercluster.org schrieb am 04/13/2007 03:22:45 PM:
>>
>>> Hello,
>>>
>>> I am using TORQUE 2.1.6, trying to transfer stdout of a job using the -o
>>> option of qsub. Unfortunately, no matter whether I transfer via scp or
>> set
>>> up $usecp, the transferred file is created with size 0 (zero). When I
>> use
>>> the option "-k oe" instead, the file remains in $HOME on the execute
>>> machine and contains the expected output. Can anyone please explain this
>>> or give a tip which log file to inspect or what experiments to perform
>> to
>>> gather more information?
>> Solved. The disk with /var/spool/torque on the execute machine was full.
>>
>> I'd classify it as an error handling bug in TORQUE. We had to strace the
>> child process to debug it - shouldn't be necessary.
>
> I've wrestled with this myself on my own cluster. It is not uncommon
> for users to fill up /var with too much stdout/stderr, fail to have the
> huge file copy to home, and then stick around in undelivered.
>
> Clearly, the most correct thing would be to kill the job if writes to
> spool files fail, but users don't necessarily consider this to be a
> fatal condition for their job.
>
> On my own cluster, some users are just sending debug info to
> stdout/stderr, and their *real* output is going to a different file, in
> this case the users prefer that the system do everything possible to
> keep the job running.
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list