[torqueusers] Limit size of standard output / standard error

Chad Vizino vizino at psc.edu
Mon Dec 15 09:12:13 MST 2008


Hi David,

David Singleton has addressed this at his site in OpenPBS (see post 
below).  We've modified it for Torque use at our site and it works 
great.  From src/resmom/linux/mom_mach.c:

Put this with other externs:

extern  char    *path_spool;

 From mom_over_limit():

#if NO_SPOOL_OUTPUT == 0
#define VARSPOOLUSERLIM_KB 20480

   /* check file sizes in PBS spool area */
   if (pjob->ji_qs.ji_svrflags&JOB_SVFLG_HERE) { /* only on MS */
     char path[64];
     char *suf;
     struct stat sbuf;

     (void)strcpy(path, path_spool);
     (void)strcat(path, pjob->ji_qs.ji_fileprefix);
     suf = path+strlen(path);

     (void)strcat(path, JOB_STDOUT_SUFFIX);
     if ( (stat(path, &sbuf)==0) &&
       (sbuf.st_size>>10 > (off_t)VARSPOOLUSERLIM_KB) ){
       sprintf(log_buffer, "stdout file size %luKB exceeded limit %luKB",
       ((unsigned long)(sbuf.st_size>>10)), (unsigned 
long)VARSPOOLUSERLIM_KB);
       return (TRUE);
     }

     (void)strcpy(suf, JOB_STDERR_SUFFIX);
     if ( (stat(path, &sbuf)==0) &&
       (sbuf.st_size>>10 > (off_t)VARSPOOLUSERLIM_KB) ){
       sprintf(log_buffer, "stderr file size %luKB exceeded limit %luKB",
       ((unsigned long)(sbuf.st_size>>10)), (unsigned 
long)VARSPOOLUSERLIM_KB);
       return (TRUE);
     }
   }
#endif


Regards,
   -Chad

Chad Vizino
Pittsburgh Supercomputing Center

> Subject: Re: [torqueusers] User's job can mess up the system so thatno jobs run
> Date: Fri, 07 Sep 2007 21:44:14 +1000
> From: David Singleton <David.Singleton at anu.edu.au>
> Reply-To: David.Singleton at anu.edu.au
> Organization: ANUSF
> To: Atwood, Robert C <r.atwood at imperial.ac.uk>
> CC: torqueusers at supercluster.org
> 
> 
> This is a hacky bit of code we have at the end of mom_over_limit()
> in our PBS - it kills jobs when spooled stdout or stderr reach 20MB
> (who will ever read 20MB of text!).  It would need modifying for
> Torque.
> 
> David
> 
> /* This should be a mom config option */
> #define CHECKVAR
> 
> #if !defined(NO_SPOOL_OUTPUT) && defined(CHECKVAR)
> #define VARSPOOLUSERLIM_KB 20480
> 
>         /* check file sizes in PBS spool area */
>         if (pjob->ji_qs.ji_svrflags&JOB_SVFLG_HERE) { // only on MS
>                 char path[64];
>                 char *suf;
>                 struct stat sbuf;
> 
>                 (void)strcpy(path, path_spool);
>                 (void)strcat(path, pjob->ji_qs.ji_fileprefix);
>                 suf = path+strlen(path);
> 
>                 (void)strcat(path, JOB_STDOUT_SUFFIX);
>                 if ( (stat(path, &sbuf)==0) &&
>                      (sbuf.st_size>>10 > (off_t)VARSPOOLUSERLIM_KB) ){
>                         sprintf(log_buffer, "stdout file size %luKB exceeds limit %luKB",
>                                 ((unsigned long)(sbuf.st_size>>10)), (unsigned long)VARSPOOLUSERLIM_KB);
>                         return (JOB_SVFLG_OVERLMT2|JOB_SVFLG_OVERLMTFILE);
>                 }
> 
>                 (void)strcpy(suf, JOB_STDERR_SUFFIX);
>                 if ( (stat(path, &sbuf)==0) &&
>                      (sbuf.st_size>>10 > (off_t)VARSPOOLUSERLIM_KB) ){
>                         sprintf(log_buffer, "stderr file size %luKB exceeds limit %luKB",
>                                 ((unsigned long)(sbuf.st_size>>10)), (unsigned long)VARSPOOLUSERLIM_KB);
>                         return (JOB_SVFLG_OVERLMT2|JOB_SVFLG_OVERLMTFILE);
>                 }
>         }
> #endif
> 
> ...
> --------------------------------------------------------------------------
>    Dr David Singleton               ANU Supercomputer Facility
>    HPC Systems Manager              and APAC National Facility
>    David.Singleton at anu.edu.au       Leonard Huxley Bldg (No. 56)
>    Phone: +61 2 6125 4389           Australian National University
>    Fax:   +61 2 6125 8199           Canberra, ACT, 0200, Australia
> --------------------------------------------------------------------------



On 12/11/08 9:27 PM, David Schibeci wrote:
> This is probably a lame question, but I've done a google search and 
> can't find an answer.
> 
> Is there a way to get torque to limit the size of standard output/error? 
> And kill the job if it exceeds this limit?
> 
> We have diskless nodes, and if standard output/error gets too big, then 
> the machine runs out of RAM.
> 
> Thanks in advance,
> David
> 
> ------------------------------------------------------------------------------ 
> 
> David Schibeci
> Senior Systems Administrator
> iVEC Informatics Facility
> Centre for Comparative Genomics
> Murdoch University
> South Street
> Murdoch WA 6150
> 
> Phone: 61 8 9360 2492
> Fax: 61 8 9360 7238
> E-Mail: dschibeci at ccg.murdoch.edu.au


More information about the torqueusers mailing list