[torqueusers] Could someone please comment on this?

Jan Ploski Jan.Ploski at offis.de
Thu May 22 08:46:54 MDT 2008


torqueusers-bounces at supercluster.org schrieb am 05/21/2008 08:42:31 PM:

> Hello,
> 
> I am really having trouble with the qsub error:
> 
> rm_18238:  p4_error: semget failed for setnum: 0
> p0_3227: (1.105469) net_recv failed for fd = 15
> p0_3227:  p4_error: net_recv read, errno = : 104
> p0_3227: (139.230469) net_send: could not write to fd=4, errno = 32
> 
> 
> I do not know how to begin to trace this error.  I have looked at my 
> nodes file in the PBS_NODES_FILE var and thought I found the offending 
> node - re-installed rocks but no change.

I don't think it is a qsub issue. It looks more like a problem with shared 
memory management in MPICH. Google results suggest that you should run 
'ipcs' as root on the node to find out if there are any non-released 
shared memory segments. If needed, remove them manually with 'cleanipcs' 
or 'ipcrm'.

Regards,
Jan Ploski


More information about the torqueusers mailing list