[torqueusers] Re: moms clearing their own offline status
Garrick Staples
garrick at usc.edu
Fri Oct 29 14:31:16 MDT 2004
On Fri, Oct 29, 2004 at 12:38:10PM -0700, Garrick Staples alleged:
> Torque/Maui is getting so good at solving all of the bigger issues, I'm
> starting to drill down into the smaller annoying ones :)
>
> This has been bugging me for a long time now, but I've only finally figured out
> to reproduce it. I've always noticed that sometimes when I boot a node that
> was marked offline, it will have the status cleared when pbs_mom starts.
>
> Today I found that I can repeat it 100%. It only happens when pbs_mom wasn't
> shutdown cleanly or pbs_server was unreachable when it was shutdown. You can
> either bring down networking, crash the machine, or kill -9 pbs_mom, and the
> mom will always be online again when it starts up.
I think I found it. This code kicks in when a mom starts up, but server still
has a valid connection entry. Instead of just setting state unknown, it should
preserve the offline state.
diff -ruN torque-1.1.0p4_orig/src/server/node_manager.c torque-1.1.0p4/src/server/node_manager.c
--- torque-1.1.0p4_orig/src/server/node_manager.c 2004-10-28 15:50:48.000000000 -0700
+++ torque-1.1.0p4/src/server/node_manager.c 2004-10-29 13:28:06.000000000 -0700
@@ -873,7 +873,14 @@
tdelete((u_long)node->nd_stream,&streams);
- node->nd_state = INUSE_UNKNOWN;
+ if (node->nd_state & INUSE_OFFLINE)
+ {
+ node->nd_state = (INUSE_UNKNOWN|INUSE_OFFLINE);
+ }
+ else
+ {
+ node->nd_state = INUSE_UNKNOWN;
+ }
node->nd_stream = -1;
/* do a ping in 5 seconds */
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20041029/a52ca139/attachment.bin
More information about the torqueusers
mailing list