[torqueusers] share memory full, submit job failed

James J Coyle jjc at iastate.edu
Tue Jan 15 08:11:38 MST 2008


  Here is how I dealt with that situation:

  In the prologue and epilogue scripts in the pbs_mom/priv directory,
I have node_cleanup scripts which are run on each node which is dedicated.
I detect the nodes which are dedicated from counting the occurrences of
each node in the file ${PBS_NODEFILE}.  I use a simple awk script to 
count the occurrences of each node in that file and print when the
number is 4, as I have ppn=4 for each node, so that node is dedicated
to a single job.

  One of the items in the nod cleanup file is the script

which I append.

  This script can be invoked either 
1) with a single argument (a non-root username)
   which means only delete ipc resources associated that username, or
2) with no arguments, 
    which means delete all non-root ipc resources.

The comments in the script should explain what it is doing.  Some debug 
statements have been left in but commented out.

  I have also had to remove some old /dev/shm  files in a similar manner.
(When a node is dedicated to a single user, any /dev/shm files belonging to
another non-root user can be removed.)

I am appending the node_ipc_cleanup script that I use.

 James Coyle, PhD
 SGI Origin, Alpha, Xeon and Opteron Cluster Manager
 High Performance Computing Group     
 235 Durham Center            
 Iowa State Univ.           phone: (515)-294-2099
 Ames, Iowa 50011           web: http://jjc.public.iastate.edu


if [ $# = 1 ] ; then 
# If user known, just delete that user's
  Q=`ipcs -q | grep $1 | awk '/^0x/ && $3 != "root" {print $2}'`
  M=`ipcs -m | grep $1 | awk '/^0x/ && $3 != "root" {print $2}'`
  S=`ipcs -s | grep $1 | awk '/^0x/ && $3 != "root" {print $2}'`
# Else delete all non-root IPC resources
  Q=`ipcs -q | awk '/^0x/ && $3 != "root" {print $2}'`
  M=`ipcs -m | awk '/^0x/ && $3 != "root" {print $2}'`
  S=`ipcs -s | awk '/^0x/ && $3 != "root" {print $2}'`

if [ -n "$Q" ] ; then
#  echo "will delete message queues $Q"
  for q in $Q ; do
#   echo "ipcrm -q $q"
   ipcrm -q $q

if [ -n "$M" ] ; then
#  echo "will delete memory segments $M"
  for m in $M ; do
#   echo "ipcrm -m $m"
   ipcrm -m $m

if [ -n "$S" ] ; then
#  echo "will delete semaphores $S"
  for s in $S ; do
#   echo "ipcrm -s $s"
   ipcrm -s $s

> Hi
>     In my cluster, a user running mpi program,but when his job finished,system share memory is full.If I not use ipcrm command delete share memory,the job through qsub will failed. who can give me good solution about share member.Thanks.
> yang
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

More information about the torqueusers mailing list