[torqueusers] job dieing immediately, 0 byte output file being produced

Coyle, James J [ITACD] jjc at iastate.edu
Tue Feb 23 10:51:07 MST 2010


Sabuj,

   My approach is probably similar to others.

   In my prologue and epilogue scripts I create a list of nodes
which were dedicated for the job. (E.g. I use an awk script to print
all nodes which occur 4 times in ${PBS_NODEFILE} for nodes which have 
4 processors.) For each node in this list, I issue node_cleanup
(my own script) which offlines any nodes which have a high percentage of
disk full for any partitions that a user can write into (/var/spool/torque and /tmp).  
I log that node for later cleanup.  The script also cleans up ipcs space, deletes
any leftover user processes, etc.

 - Jim C.

 James Coyle, PhD
 High Performance Computing Group     
 115 Durham Center            
 Iowa State Univ.          
 Ames, Iowa 50011           web: http://www.public.iastate.edu/~jjc


-----Original Message-----
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Sabuj Pattanayek
Sent: Tuesday, February 23, 2010 11:38 AM
To: torqueusers at supercluster.org
Subject: Re: [torqueusers] job dieing immediately, 0 byte output file being produced

> Nothing showing any errors, the drives are not out of space on the
> server or the node.

Ok I take that back, the node was out of diskspace in / . A job had
filled up / by writing to /var/spool/torque/spool , i.e. where the .OU
file goes before it's cp'd to the person's output directory as
specified in the .pbs file .

Any quick ideas how to protect against this using settings in torque
or maui? I saw something about limiting disk usage...

Thanks,
Sabuj
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list