[torqueusers] qsub issue - please give me a troubleshooting path

Brock Palen brockp at umich.edu
Thu May 22 22:17:45 MDT 2008


This happens all the time with shared memory enabled mpich (drives me  
nuts)  ether move to a MPI library that cleans up nicly when a app  
dies, (we use OpenMPI but many others work).

The other option is do disable shared memory, its off by default you  
had to enable it when you configured mpich.

Last to fix the problem you need to run 'cleanipcs'  on all the nodes  
ether as the user who owns the sem segments, or as root which will  
wack them all,  this will kill any running job that is using any of  
those segments though.

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985



On May 21, 2008, at 12:48 PM, Joseph Norris wrote:
> Hello to all,
>
> Here is what my users are getting on a qsub of simple hello run.
>
>
> This is in the output file:
>
> rm_16716:  p4_error: semget failed for setnum: 18
> p0_19605: (59.144531) net_send: could not write to fd=4, errno = 32
>
> the error file the user gets a series of:
>
> Killed by signal 2.
> Killed by signal 2.
>
> Can anyone give me a troubleshoot path for this?
>
> thanks.
>
> -- 
> Joseph Norris
> Application Developer & Server Adminstrator
> 209-228-4576
> jnorris at ucmerced.edu
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>



More information about the torqueusers mailing list