[torqueusers] User's job can mess up the system so thatno jobs run

David Singleton David.Singleton at anu.edu.au
Fri Sep 7 22:54:53 MDT 2007


Jeroen van den Muyzenberg wrote:
> Having worked with a few batch systems, it is far better to have a
> separate filesystem for spooled user stdout/err, and applying quotas.
> Better to have one user job fail, most probably a runaway anyway, than
> *all* work fail because of a filled filesystem.
> 
> This really isn't a torque issue, but site-defined policy.

Yes and no.  We are talking about PBS specific files here.

One of the problems is that filesystem quotas dont know about PBS jobs
but the limit needed is really per job.  Consider a 128P Altix running
lots of single cpu jobs (we know they exist).  If you think 100MB is a
reasonable limit for a jobs stdout/stderr then you need to set quotas at
12GB in case all jobs belong to one user.  But then if you have 120
different users, total stdout/stderr disk use can reach 14TB without
quotas kicking in.  Granted this is not at all likely but it illustrates
the misfit of per-user quotas for a per-job problem.  Another per-job
issue is that it might be that a job writing only 1KB of output gets
killed by the quota before the problem 1GB output job.  At least both
jobs will belong to the same user.

I guess we came to the point of having PBS monitor PBS-specific disk use
in relation to managing job scratch areas.  We tried using quotas in various
ways for quite a while but could not sufficiently disambiguate (over)usage
by different jobs.  Once we had PBS tracking this disk usage, checking
PBS-specific stdout/stderr spool usage didn't feel like crossing any
"role divide".  The site policy bit would be making checking and the
maximum spool filesize runtime configuration options.

David

> 
> Jeroen
> 
> On 07/09/07, David Singleton <David.Singleton at anu.edu.au> wrote:
>> This is a hacky bit of code we have at the end of mom_over_limit()
>> in our PBS - it kills jobs when spooled stdout or stderr reach 20MB
>> (who will ever read 20MB of text!).  It would need modifying for
>> Torque.
>>
>> David
>>
>> /* This should be a mom config option */
>> #define CHECKVAR
>>
>> #if !defined(NO_SPOOL_OUTPUT) && defined(CHECKVAR)
>> #define VARSPOOLUSERLIM_KB 20480
>>
>>          /* check file sizes in PBS spool area */
>>          if (pjob->ji_qs.ji_svrflags&JOB_SVFLG_HERE) { // only on MS
>>                  char path[64];
>>                  char *suf;
>>                  struct stat sbuf;
>>
>>                  (void)strcpy(path, path_spool);
>>                  (void)strcat(path, pjob->ji_qs.ji_fileprefix);
>>                  suf = path+strlen(path);
>>
>>                  (void)strcat(path, JOB_STDOUT_SUFFIX);
>>                  if ( (stat(path, &sbuf)==0) &&
>>                       (sbuf.st_size>>10 > (off_t)VARSPOOLUSERLIM_KB) ){
>>                          sprintf(log_buffer, "stdout file size %luKB exceeds limit %luKB",
>>                                  ((unsigned long)(sbuf.st_size>>10)), (unsigned long)VARSPOOLUSERLIM_KB);
>>                          return (JOB_SVFLG_OVERLMT2|JOB_SVFLG_OVERLMTFILE);
>>                  }
>>
>>                  (void)strcpy(suf, JOB_STDERR_SUFFIX);
>>                  if ( (stat(path, &sbuf)==0) &&
>>                       (sbuf.st_size>>10 > (off_t)VARSPOOLUSERLIM_KB) ){
>>                          sprintf(log_buffer, "stderr file size %luKB exceeds limit %luKB",
>>                                  ((unsigned long)(sbuf.st_size>>10)), (unsigned long)VARSPOOLUSERLIM_KB);
>>                          return (JOB_SVFLG_OVERLMT2|JOB_SVFLG_OVERLMTFILE);
>>                  }
>>          }
>> #endif
>>
>> Atwood, Robert C wrote:
>>>>> Aaron Tygart said:
>>>>> Hm, seems as though stdout and stderr for each respective
>>>>> job is owned by root.
>>>> Rushton Martin said:
>>>> On my system the output files are in /var/spool/torque/spool and
>>>> are owned by the user.  They move to /var/spool/torque/undelivered
>>> My system behaves like Rushton Martin's rather than Aaron Tygart's in
>>> this respect, in case the network of quotation was not clear.
>>>
>>> I received a few suggestions on and off list for mechanisms to recover
>>> and prevent this problem in the future, such as external script to test
>>> the state etc.
>>> Many thanks for the helpful suggestions.
>>>
>>> I hope it's ok if I forward some of the offlist suggestions to the list
>>> -- as future questioners may be searching the list! I hate finding the
>>> same question but no answers when I search mailing lists for my
>>> problems.
>>>
>>>  I still think it is a bit of a problem within TORQUE, that it is
>>> possible in the default setup for a single user to cause all other users
>>> jobs to fail completely silently, and hence requireing these external
>>> solutions to ensure smooth running.
>>>


More information about the torqueusers mailing list