[torqueusers] Job stdout/stderr file empty after transfer

David Singleton David.Singleton at anu.edu.au
Mon Apr 16 22:44:04 MDT 2007



This is the essence of a crude bit of code we stuck in
mom_over_limit() to clobber jobs writing too much to
/var for some random value of "too much".

David

/* This should be a mom config option */
#define CHECKVAR

#if !defined(NO_SPOOL_OUTPUT) && defined(CHECKVAR)
#define VARSPOOLUSERLIM_KB 20480

         /* check file sizes in PBS spool area */
         if (pjob->ji_qs.ji_svrflags&JOB_SVFLG_HERE) { // only on MS
                 char path[MAXPATHLEN];
                 char *suf;
                 struct stat sbuf;

                 (void)strcpy(path, path_spool);
                 (void)strcat(path, pjob->ji_qs.ji_fileprefix);
                 suf = path+strlen(path);

                 (void)strcat(path, JOB_STDOUT_SUFFIX);
                 if ( (stat(path, &sbuf)==0) &&
                      (sbuf.st_size>>10 > (off_t)VARSPOOLUSERLIM_KB) ){
                         sprintf(log_buffer, "stdout file size %luKB exceeds limit %luKB",
                                 ((unsigned long)(sbuf.st_size>>10)), (unsigned long)VARSPOOLUSERLIM_KB);
                         return (JOB_SVFLG_OVERLMT);
                 }

                 (void)strcpy(suf, JOB_STDERR_SUFFIX);
                 if ( (stat(path, &sbuf)==0) &&
                      (sbuf.st_size>>10 > (off_t)VARSPOOLUSERLIM_KB) ){
                         sprintf(log_buffer, "stderr file size %luKB exceeds limit %luKB",
                                 ((unsigned long)(sbuf.st_size>>10)), (unsigned long)VARSPOOLUSERLIM_KB);
                         return (JOB_SVFLG_OVERLMT);
                 }
         }
#endif

Garrick Staples wrote:
> On Mon, Apr 16, 2007 at 06:25:14PM +0200, Jan Ploski alleged:
>> torqueusers-bounces at supercluster.org schrieb am 04/13/2007 03:22:45 PM:
>>
>>> Hello,
>>>
>>> I am using TORQUE 2.1.6, trying to transfer stdout of a job using the -o 
>>> option of qsub. Unfortunately, no matter whether I transfer via scp or 
>> set 
>>> up $usecp, the transferred file is created with size 0 (zero). When I 
>> use 
>>> the option "-k oe" instead, the file remains in $HOME on the execute 
>>> machine and contains the expected output. Can anyone please explain this 
>>> or give a tip which log file to inspect or what experiments to perform 
>> to 
>>> gather more information?
>> Solved. The disk with /var/spool/torque on the execute machine was full.
>>
>> I'd classify it as an error handling bug in TORQUE. We had to strace the 
>> child process to debug it - shouldn't be necessary.
> 
> I've wrestled with this myself on my own cluster.  It is not uncommon
> for users to fill up /var with too much stdout/stderr, fail to have the
> huge file copy to home, and then stick around in undelivered.
> 
> Clearly, the most correct thing would be to kill the job if writes to
> spool files fail, but users don't necessarily consider this to be a
> fatal condition for their job.  
> 
> On my own cluster, some users are just sending debug info to
> stdout/stderr, and their *real* output is going to a different file, in
> this case the users prefer that the system do everything possible to
> keep the job running.
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list