[torqueusers] User's job can mess up the system so thatno jobs run
David Singleton
David.Singleton at anu.edu.au
Fri Sep 7 05:44:14 MDT 2007
This is a hacky bit of code we have at the end of mom_over_limit()
in our PBS - it kills jobs when spooled stdout or stderr reach 20MB
(who will ever read 20MB of text!). It would need modifying for
Torque.
David
/* This should be a mom config option */
#define CHECKVAR
#if !defined(NO_SPOOL_OUTPUT) && defined(CHECKVAR)
#define VARSPOOLUSERLIM_KB 20480
/* check file sizes in PBS spool area */
if (pjob->ji_qs.ji_svrflags&JOB_SVFLG_HERE) { // only on MS
char path[64];
char *suf;
struct stat sbuf;
(void)strcpy(path, path_spool);
(void)strcat(path, pjob->ji_qs.ji_fileprefix);
suf = path+strlen(path);
(void)strcat(path, JOB_STDOUT_SUFFIX);
if ( (stat(path, &sbuf)==0) &&
(sbuf.st_size>>10 > (off_t)VARSPOOLUSERLIM_KB) ){
sprintf(log_buffer, "stdout file size %luKB exceeds limit %luKB",
((unsigned long)(sbuf.st_size>>10)), (unsigned long)VARSPOOLUSERLIM_KB);
return (JOB_SVFLG_OVERLMT2|JOB_SVFLG_OVERLMTFILE);
}
(void)strcpy(suf, JOB_STDERR_SUFFIX);
if ( (stat(path, &sbuf)==0) &&
(sbuf.st_size>>10 > (off_t)VARSPOOLUSERLIM_KB) ){
sprintf(log_buffer, "stderr file size %luKB exceeds limit %luKB",
((unsigned long)(sbuf.st_size>>10)), (unsigned long)VARSPOOLUSERLIM_KB);
return (JOB_SVFLG_OVERLMT2|JOB_SVFLG_OVERLMTFILE);
}
}
#endif
Atwood, Robert C wrote:
>>> Aaron Tygart said:
>>> Hm, seems as though stdout and stderr for each respective
>>> job is owned by root.
>
>> Rushton Martin said:
>> On my system the output files are in /var/spool/torque/spool and
>> are owned by the user. They move to /var/spool/torque/undelivered
>
> My system behaves like Rushton Martin's rather than Aaron Tygart's in
> this respect, in case the network of quotation was not clear.
>
> I received a few suggestions on and off list for mechanisms to recover
> and prevent this problem in the future, such as external script to test
> the state etc.
> Many thanks for the helpful suggestions.
>
> I hope it's ok if I forward some of the offlist suggestions to the list
> -- as future questioners may be searching the list! I hate finding the
> same question but no answers when I search mailing lists for my
> problems.
>
> I still think it is a bit of a problem within TORQUE, that it is
> possible in the default setup for a single user to cause all other users
> jobs to fail completely silently, and hence requireing these external
> solutions to ensure smooth running.
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
--
--------------------------------------------------------------------------
Dr David Singleton ANU Supercomputer Facility
HPC Systems Manager and APAC National Facility
David.Singleton at anu.edu.au Leonard Huxley Bldg (No. 56)
Phone: +61 2 6125 4389 Australian National University
Fax: +61 2 6125 8199 Canberra, ACT, 0200, Australia
--------------------------------------------------------------------------
More information about the torqueusers
mailing list