[torqueusers] limiting stdout/stderr files

Chad Vizino vizino at psc.edu
Wed Dec 15 12:24:48 MST 2010


We've hacked David's "hacky bit of code" (see below) to work with Torque 
at our site and find it quite useful.  Thanks Dave!

Here's our Torque version currently in use with 2.3.12:

#if !defined(NO_SPOOL_OUTPUT)
#define VARSPOOLUSERLIM_KB 20480

   /* check file sizes in PBS spool area */
   if (pjob->ji_qs.ji_svrflags&JOB_SVFLG_HERE) { /* only on MS */
     char path[64];
     char *suf;
     struct stat sbuf;

     (void)strcpy(path, path_spool);
     (void)strcat(path, pjob->ji_qs.ji_fileprefix);
     suf = path+strlen(path);

     (void)strcat(path, JOB_STDOUT_SUFFIX);
     if ( (stat(path, &sbuf)==0) &&
       (sbuf.st_size>>10 > (off_t)VARSPOOLUSERLIM_KB) ){
       sprintf(log_buffer, "stdout file size %luKB exceeded limit %luKB",
       ((unsigned long)(sbuf.st_size>>10)), (unsigned 
long)VARSPOOLUSERLIM_KB);
       return (TRUE);
     }

     (void)strcpy(suf, JOB_STDERR_SUFFIX);
     if ( (stat(path, &sbuf)==0) &&
       (sbuf.st_size>>10 > (off_t)VARSPOOLUSERLIM_KB) ){
       sprintf(log_buffer, "stderr file size %luKB exceeded limit %luKB",
       ((unsigned long)(sbuf.st_size>>10)), (unsigned 
long)VARSPOOLUSERLIM_KB);
       return (TRUE);
     }
   }
#endif

  -Chad

> Subject: Re: [torqueusers] User's job can mess up the system so thatno jobs run
> Date: Fri, 07 Sep 2007 21:44:14 +1000
> From: David Singleton <David.Singleton at anu.edu.au>
> Reply-To: David.Singleton at anu.edu.au
> Organization: ANUSF
> To: Atwood, Robert C <r.atwood at imperial.ac.uk>
> CC: torqueusers at supercluster.org
>
>
> This is a hacky bit of code we have at the end of mom_over_limit()
> in our PBS - it kills jobs when spooled stdout or stderr reach 20MB
> (who will ever read 20MB of text!).  It would need modifying for
> Torque.
>
> David
>
> /* This should be a mom config option */
> #define CHECKVAR
>
> #if !defined(NO_SPOOL_OUTPUT) && defined(CHECKVAR)
> #define VARSPOOLUSERLIM_KB 20480
>
>         /* check file sizes in PBS spool area */
>         if (pjob->ji_qs.ji_svrflags&JOB_SVFLG_HERE) { // only on MS
>                 char path[64];
>                 char *suf;
>                 struct stat sbuf;
>
>                 (void)strcpy(path, path_spool);
>                 (void)strcat(path, pjob->ji_qs.ji_fileprefix);
>                 suf = path+strlen(path);
>
>                 (void)strcat(path, JOB_STDOUT_SUFFIX);
>                 if ( (stat(path, &sbuf)==0) &&
>                      (sbuf.st_size>>10 > (off_t)VARSPOOLUSERLIM_KB) ){
>                         sprintf(log_buffer, "stdout file size %luKB exceeds limit %luKB",
>                                 ((unsigned long)(sbuf.st_size>>10)), (unsigned long)VARSPOOLUSERLIM_KB);
>                         return (JOB_SVFLG_OVERLMT2|JOB_SVFLG_OVERLMTFILE);
>                 }
>
>                 (void)strcpy(suf, JOB_STDERR_SUFFIX);
>                 if ( (stat(path, &sbuf)==0) &&
>                      (sbuf.st_size>>10 > (off_t)VARSPOOLUSERLIM_KB) ){
>                         sprintf(log_buffer, "stderr file size %luKB exceeds limit %luKB",
>                                 ((unsigned long)(sbuf.st_size>>10)), (unsigned long)VARSPOOLUSERLIM_KB);
>                         return (JOB_SVFLG_OVERLMT2|JOB_SVFLG_OVERLMTFILE);
>                 }
>         }
> #endif
>
> Atwood, Robert C wrote:
>>>> Aaron Tygart said:
>>>> Hm, seems as though stdout and stderr for each respective
>>>> job is owned by root.
>>
>>> Rushton Martin said:
>>> On my system the output files are in /var/spool/torque/spool and
>>> are owned by the user.  They move to /var/spool/torque/undelivered
>>
>> My system behaves like Rushton Martin's rather than Aaron Tygart's in
>> this respect, in case the network of quotation was not clear.
>>
>> I received a few suggestions on and off list for mechanisms to recover
>> and prevent this problem in the future, such as external script to test
>> the state etc.
>> Many thanks for the helpful suggestions.
>>
>> I hope it's ok if I forward some of the offlist suggestions to the list
>> -- as future questioners may be searching the list! I hate finding the
>> same question but no answers when I search mailing lists for my
>> problems.
>>
>>  I still think it is a bit of a problem within TORQUE, that it is
>> possible in the default setup for a single user to cause all other users
>> jobs to fail completely silently, and hence requireing these external
>> solutions to ensure smooth running.
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> --
> --------------------------------------------------------------------------
>    Dr David Singleton               ANU Supercomputer Facility
>    HPC Systems Manager              and APAC National Facility
>    David.Singleton at anu.edu.au       Leonard Huxley Bldg (No. 56)
>    Phone: +61 2 6125 4389           Australian National University
>    Fax:   +61 2 6125 8199           Canberra, ACT, 0200, Australia
> --------------------------------------------------------------------------
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>


On 12/10/10 2:23 PM, Stijn De Weirdt wrote:
> hi all
>
> is there an easy way to limit the size the stdout and stderr files can
> have on the moms?
>
>
>
> stijn


More information about the torqueusers mailing list