[torqueusers] one more trouble report

Alexander Saydakov saydakov at yahoo-inc.com
Thu Apr 27 13:21:36 MDT 2006


We have one more incident when pbs_server stopped responding (no core dump,
just hung). This was a consequence of loosing 32 nodes in the cluster (out
of 66). They had no activity at the moment and were intentionally rebooted
by system administrators for some maintenance purposes. In the log file I
see the server marking them down one after another. After 17 such entries
pbs_server became unresponsive. I had to kill it with -9 signal.

 

We run 2.0.0p8 (with localhost caching and delete node patches) on FreeBSD
4.10

 

I noticed that almost every crash of pbs_server had to do with removing
nodes either by doing qmgr -c 'delete node .' or in this case simply by
losing connection to nodes. Adding nodes was always fine. And with no
changes to configuration it runs forever.

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20060427/1a452460/attachment.html


More information about the torqueusers mailing list