[torqueusers] Adding a new node requires restart of pbs_server (bug)

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Fri Sep 30 04:41:33 MDT 2005


I'm seeing parallel jobs refusing to start correctly when the MOM
superior of the job runs on a node which has just been added to
the cluster.  Another node's MOM in the sisterhood logs this:

09/30/2005 11:35:32;0001;   pbs_mom;Svr;pbs_mom;im_request, bad connect from
10.1.129.7:1023 - unauthorized (okclients: 
10.1.129.139,10.1.129.138,10.1.129.137,10.1.129.136,10.1.129.135,
10.1.129.134,10.1.129.133,10.1.129.132,10.1.129.131,
10.1.129.130,10.1.129.129,10.1.129.128,10.1.129.127,10.1.129.126,10.1.129.125,
10.1.129.124,10.1.129.123,10.1.129.122,10.1.129.121,10.1.129.120,10.1.129.119,
10.1.129.118,10.1.129.117,10.1.129.116,10.1.129.115,10.1.129.114,10.1.129.113,
10.1.129.112,10.1.129.111,10.1.129.110,10.1.129.109,10.1.129.108,10.1.129.107,
10.1.129.106,10.1.129.105,10.1.129.104,10.1.129.103,10.1.129.102,10.1.129.101,
10.1.129.100,10.1.129.159,10.1.129.219,10.1.130.19,10.1.130.202,10.1.128.2,
10.1.130.218,127.0.0.1)

The node 10.1.129.7 was just added to the cluster, and is being
refused by other MOMs.  I googled for this error message and found
a workaround here:
http://www.supercluster.org/pipermail/torqueusers/2004-September/000746.html
You have to shut down pbs_server (and Maui) and restart it.
This solves the problem.

So this is a real bug in Torque, and not due to an unclean shutdown
of a node's pbs_mom.  If you add a new node to the cluster, it seems
that you need to restart pbs-server.  Not very elegant :-(

Actually, I just now found this bug in the Torque Bugzilla at
    http://www.clusterresources.com/bugzilla/show_bug.cgi?id=91
so I can add that it's reproduced at other sites as well.
Restarting the pbs_mom on nodes is of course an unacceptable
workaround in a production environment, but the pbs_server
restart seems to do the trick.

I'm running Torque 1.2.0p6 (as distributed) on Centos 4.1 Linux
(a RHEL 4.0 clone).

-- 
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


More information about the torqueusers mailing list