[torqueusers] share memory full, submit job failed
James J Coyle
jjc at iastate.edu
Tue Jan 15 08:11:38 MST 2008
Yang,
Here is how I dealt with that situation:
In the prologue and epilogue scripts in the pbs_mom/priv directory,
I have node_cleanup scripts which are run on each node which is dedicated.
I detect the nodes which are dedicated from counting the occurrences of
each node in the file ${PBS_NODEFILE}. I use a simple awk script to
count the occurrences of each node in that file and print when the
number is 4, as I have ppn=4 for each node, so that node is dedicated
to a single job.
One of the items in the nod cleanup file is the script
node_ipc_cleanup
which I append.
This script can be invoked either
1) with a single argument (a non-root username)
which means only delete ipc resources associated that username, or
2) with no arguments,
which means delete all non-root ipc resources.
The comments in the script should explain what it is doing. Some debug
statements have been left in but commented out.
I have also had to remove some old /dev/shm files in a similar manner.
(When a node is dedicated to a single user, any /dev/shm files belonging to
another non-root user can be removed.)
I am appending the node_ipc_cleanup script that I use.
--
James Coyle, PhD
SGI Origin, Alpha, Xeon and Opteron Cluster Manager
High Performance Computing Group
235 Durham Center
Iowa State Univ. phone: (515)-294-2099
Ames, Iowa 50011 web: http://jjc.public.iastate.edu
#!/bin/ksh
if [ $# = 1 ] ; then
# If user known, just delete that user's
Q=`ipcs -q | grep $1 | awk '/^0x/ && $3 != "root" {print $2}'`
M=`ipcs -m | grep $1 | awk '/^0x/ && $3 != "root" {print $2}'`
S=`ipcs -s | grep $1 | awk '/^0x/ && $3 != "root" {print $2}'`
else
# Else delete all non-root IPC resources
Q=`ipcs -q | awk '/^0x/ && $3 != "root" {print $2}'`
M=`ipcs -m | awk '/^0x/ && $3 != "root" {print $2}'`
S=`ipcs -s | awk '/^0x/ && $3 != "root" {print $2}'`
fi
if [ -n "$Q" ] ; then
# echo "will delete message queues $Q"
for q in $Q ; do
# echo "ipcrm -q $q"
ipcrm -q $q
done
fi
if [ -n "$M" ] ; then
# echo "will delete memory segments $M"
for m in $M ; do
# echo "ipcrm -m $m"
ipcrm -m $m
done
fi
if [ -n "$S" ] ; then
# echo "will delete semaphores $S"
for s in $S ; do
# echo "ipcrm -s $s"
ipcrm -s $s
done
fi
> Hi
>
> In my cluster, a user running mpi program,but when his job finished,system share memory is full.If I not use ipcrm command delete share memory,the job through qsub will failed. who can give me good solution about share member.Thanks.
>
>
> yang
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
More information about the torqueusers
mailing list