[torqueusers] qsub issue - please give me a troubleshooting path
Brock Palen
brockp at umich.edu
Thu May 22 22:17:45 MDT 2008
This happens all the time with shared memory enabled mpich (drives me
nuts) ether move to a MPI library that cleans up nicly when a app
dies, (we use OpenMPI but many others work).
The other option is do disable shared memory, its off by default you
had to enable it when you configured mpich.
Last to fix the problem you need to run 'cleanipcs' on all the nodes
ether as the user who owns the sem segments, or as root which will
wack them all, this will kill any running job that is using any of
those segments though.
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985
On May 21, 2008, at 12:48 PM, Joseph Norris wrote:
> Hello to all,
>
> Here is what my users are getting on a qsub of simple hello run.
>
>
> This is in the output file:
>
> rm_16716: p4_error: semget failed for setnum: 18
> p0_19605: (59.144531) net_send: could not write to fd=4, errno = 32
>
> the error file the user gets a series of:
>
> Killed by signal 2.
> Killed by signal 2.
>
> Can anyone give me a troubleshoot path for this?
>
> thanks.
>
> --
> Joseph Norris
> Application Developer & Server Adminstrator
> 209-228-4576
> jnorris at ucmerced.edu
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
More information about the torqueusers
mailing list