[torqueusers] User's job can mess up the system so thatno jobs run

Jeroen van den Muyzenberg jeroen.vandenmuyzenberg at gmail.com
Fri Sep 7 09:04:17 MDT 2007


Having worked with a few batch systems, it is far better to have a
separate filesystem for spooled user stdout/err, and applying quotas.
Better to have one user job fail, most probably a runaway anyway, than
*all* work fail because of a filled filesystem.

This really isn't a torque issue, but site-defined policy.

Jeroen

On 07/09/07, David Singleton <David.Singleton at anu.edu.au> wrote:
>
> This is a hacky bit of code we have at the end of mom_over_limit()
> in our PBS - it kills jobs when spooled stdout or stderr reach 20MB
> (who will ever read 20MB of text!).  It would need modifying for
> Torque.
>
> David
>
> /* This should be a mom config option */
> #define CHECKVAR
>
> #if !defined(NO_SPOOL_OUTPUT) && defined(CHECKVAR)
> #define VARSPOOLUSERLIM_KB 20480
>
>          /* check file sizes in PBS spool area */
>          if (pjob->ji_qs.ji_svrflags&JOB_SVFLG_HERE) { // only on MS
>                  char path[64];
>                  char *suf;
>                  struct stat sbuf;
>
>                  (void)strcpy(path, path_spool);
>                  (void)strcat(path, pjob->ji_qs.ji_fileprefix);
>                  suf = path+strlen(path);
>
>                  (void)strcat(path, JOB_STDOUT_SUFFIX);
>                  if ( (stat(path, &sbuf)==0) &&
>                       (sbuf.st_size>>10 > (off_t)VARSPOOLUSERLIM_KB) ){
>                          sprintf(log_buffer, "stdout file size %luKB exceeds limit %luKB",
>                                  ((unsigned long)(sbuf.st_size>>10)), (unsigned long)VARSPOOLUSERLIM_KB);
>                          return (JOB_SVFLG_OVERLMT2|JOB_SVFLG_OVERLMTFILE);
>                  }
>
>                  (void)strcpy(suf, JOB_STDERR_SUFFIX);
>                  if ( (stat(path, &sbuf)==0) &&
>                       (sbuf.st_size>>10 > (off_t)VARSPOOLUSERLIM_KB) ){
>                          sprintf(log_buffer, "stderr file size %luKB exceeds limit %luKB",
>                                  ((unsigned long)(sbuf.st_size>>10)), (unsigned long)VARSPOOLUSERLIM_KB);
>                          return (JOB_SVFLG_OVERLMT2|JOB_SVFLG_OVERLMTFILE);
>                  }
>          }
> #endif
>
> Atwood, Robert C wrote:
> >>> Aaron Tygart said:
> >>> Hm, seems as though stdout and stderr for each respective
> >>> job is owned by root.
> >
> >> Rushton Martin said:
> >> On my system the output files are in /var/spool/torque/spool and
> >> are owned by the user.  They move to /var/spool/torque/undelivered
> >
> > My system behaves like Rushton Martin's rather than Aaron Tygart's in
> > this respect, in case the network of quotation was not clear.
> >
> > I received a few suggestions on and off list for mechanisms to recover
> > and prevent this problem in the future, such as external script to test
> > the state etc.
> > Many thanks for the helpful suggestions.
> >
> > I hope it's ok if I forward some of the offlist suggestions to the list
> > -- as future questioners may be searching the list! I hate finding the
> > same question but no answers when I search mailing lists for my
> > problems.
> >
> >  I still think it is a bit of a problem within TORQUE, that it is
> > possible in the default setup for a single user to cause all other users
> > jobs to fail completely silently, and hence requireing these external
> > solutions to ensure smooth running.
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> --
> --------------------------------------------------------------------------
>     Dr David Singleton               ANU Supercomputer Facility
>     HPC Systems Manager              and APAC National Facility
>     David.Singleton at anu.edu.au       Leonard Huxley Bldg (No. 56)
>     Phone: +61 2 6125 4389           Australian National University
>     Fax:   +61 2 6125 8199           Canberra, ACT, 0200, Australia
> --------------------------------------------------------------------------
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>


More information about the torqueusers mailing list