[torqueusers] User's job can mess up the system so thatno jobs run

David Singleton David.Singleton at anu.edu.au
Fri Sep 7 05:44:14 MDT 2007


This is a hacky bit of code we have at the end of mom_over_limit()
in our PBS - it kills jobs when spooled stdout or stderr reach 20MB
(who will ever read 20MB of text!).  It would need modifying for
Torque.

David

/* This should be a mom config option */
#define CHECKVAR

#if !defined(NO_SPOOL_OUTPUT) && defined(CHECKVAR)
#define VARSPOOLUSERLIM_KB 20480

         /* check file sizes in PBS spool area */
         if (pjob->ji_qs.ji_svrflags&JOB_SVFLG_HERE) { // only on MS
                 char path[64];
                 char *suf;
                 struct stat sbuf;

                 (void)strcpy(path, path_spool);
                 (void)strcat(path, pjob->ji_qs.ji_fileprefix);
                 suf = path+strlen(path);

                 (void)strcat(path, JOB_STDOUT_SUFFIX);
                 if ( (stat(path, &sbuf)==0) &&
                      (sbuf.st_size>>10 > (off_t)VARSPOOLUSERLIM_KB) ){
                         sprintf(log_buffer, "stdout file size %luKB exceeds limit %luKB",
                                 ((unsigned long)(sbuf.st_size>>10)), (unsigned long)VARSPOOLUSERLIM_KB);
                         return (JOB_SVFLG_OVERLMT2|JOB_SVFLG_OVERLMTFILE);
                 }

                 (void)strcpy(suf, JOB_STDERR_SUFFIX);
                 if ( (stat(path, &sbuf)==0) &&
                      (sbuf.st_size>>10 > (off_t)VARSPOOLUSERLIM_KB) ){
                         sprintf(log_buffer, "stderr file size %luKB exceeds limit %luKB",
                                 ((unsigned long)(sbuf.st_size>>10)), (unsigned long)VARSPOOLUSERLIM_KB);
                         return (JOB_SVFLG_OVERLMT2|JOB_SVFLG_OVERLMTFILE);
                 }
         }
#endif

Atwood, Robert C wrote:
>>> Aaron Tygart said:
>>> Hm, seems as though stdout and stderr for each respective 
>>> job is owned by root.
> 
>> Rushton Martin said:
>> On my system the output files are in /var/spool/torque/spool and
>> are owned by the user.  They move to /var/spool/torque/undelivered
> 
> My system behaves like Rushton Martin's rather than Aaron Tygart's in
> this respect, in case the network of quotation was not clear. 
> 
> I received a few suggestions on and off list for mechanisms to recover
> and prevent this problem in the future, such as external script to test
> the state etc.
> Many thanks for the helpful suggestions.
> 
> I hope it's ok if I forward some of the offlist suggestions to the list
> -- as future questioners may be searching the list! I hate finding the
> same question but no answers when I search mailing lists for my
> problems.
> 
>  I still think it is a bit of a problem within TORQUE, that it is
> possible in the default setup for a single user to cause all other users
> jobs to fail completely silently, and hence requireing these external
> solutions to ensure smooth running.  
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


-- 
--------------------------------------------------------------------------
    Dr David Singleton               ANU Supercomputer Facility
    HPC Systems Manager              and APAC National Facility
    David.Singleton at anu.edu.au       Leonard Huxley Bldg (No. 56)
    Phone: +61 2 6125 4389           Australian National University
    Fax:   +61 2 6125 8199           Canberra, ACT, 0200, Australia
--------------------------------------------------------------------------


More information about the torqueusers mailing list