[torqueusers] job dieing immediately, 0 byte output file being produced
Coyle, James J [ITACD]
jjc at iastate.edu
Tue Feb 23 10:51:07 MST 2010
Sabuj,
My approach is probably similar to others.
In my prologue and epilogue scripts I create a list of nodes
which were dedicated for the job. (E.g. I use an awk script to print
all nodes which occur 4 times in ${PBS_NODEFILE} for nodes which have
4 processors.) For each node in this list, I issue node_cleanup
(my own script) which offlines any nodes which have a high percentage of
disk full for any partitions that a user can write into (/var/spool/torque and /tmp).
I log that node for later cleanup. The script also cleans up ipcs space, deletes
any leftover user processes, etc.
- Jim C.
James Coyle, PhD
High Performance Computing Group
115 Durham Center
Iowa State Univ.
Ames, Iowa 50011 web: http://www.public.iastate.edu/~jjc
-----Original Message-----
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Sabuj Pattanayek
Sent: Tuesday, February 23, 2010 11:38 AM
To: torqueusers at supercluster.org
Subject: Re: [torqueusers] job dieing immediately, 0 byte output file being produced
> Nothing showing any errors, the drives are not out of space on the
> server or the node.
Ok I take that back, the node was out of diskspace in / . A job had
filled up / by writing to /var/spool/torque/spool , i.e. where the .OU
file goes before it's cp'd to the person's output directory as
specified in the .pbs file .
Any quick ideas how to protect against this using settings in torque
or maui? I saw something about limiting disk usage...
Thanks,
Sabuj
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list