[torquedev] Torque 4.0 at startup

David Beer dbeer at adaptivecomputing.com
Fri Jul 13 10:22:18 MDT 2012


Craig,

That is most certainly a bug. We will see if we can reproduce it but that
definitely shouldn't be happening. Nodes should be down until they report
as up.

David

On Fri, Jul 13, 2012 at 12:19 AM, Craig West <cwest at vpac.org> wrote:

>
> Hi All,
>
> This is an observation on Torque 4.0.2+ (not sure about other 4.X
> versions). I'm not sure if what I am seeing is a bug, or a change in the
> way things are being monitored... would like to hear from the developers
> about what the plan is/was, and what they were trying to achieve (if it
> is not just a bug).
>
> Setting up the scenario first. I have a test cluster with 5 nodes. 3 of
> the nodes are online, one is powered on (but not running pbs_mom) and
> the last is not even powered on.
>
> It was a bit of a surprise to me that when I started pbs_server recently
> all 5 nodes were reporting "free"... something that isn't even possible
> (only 3 should have been free). So, after a few different tests I
> figured that the default startup state of nodes is now "free". While I
> don't recall what the original startup state, I don't recall it being
> "free"... think it might have been "down".
>
> It seems this is related to the status recorded in the
> server_priv/node_status file. If I take the node "offline" (as opposed
> to it just being "down" because pbs_mom is not running), the entry for
> the node in the node_status file vanishes... restarting pbs_server has
> the node start with an offline state. All the nodes that are online have
> a "0" state.
>
> Previously if a node was taken offline, it would have a "1" in the
> node_status file (e.g. nodename 1). If a node was online then it didn't
> have an entry at all. Thinking that perhaps the change was done to allow
> for the large scale clusters benefit.
>
> If the node is simply down there is no difference recorded in the
> node_status file... that is at pbs_server startup it assumes the node
> should be online, and marks it as free. I even deleted the file, which
> caused the nodes to come online as they were detected rather than assume
> they were all online.
>
>
> So, why do I bring all this up??? I'm seeing a case here where MOAB is
> trying to launch jobs on the nodes that are "down", and they fail, and
> get pushed to defer (straight after a pbs_server start - which includes
> a boot of the management node). It is possible these jobs could have
> started on a different online node (or stayed in idle). Not a big issue
> as the jobs should start again fine after the defer period.
> However, it does mean that a job that was at the top of the Idle queue
> (highest priority) could be deferred and then the lower priority jobs
> get to start. This will cause an issue with the users...
>
> Note: I get the following error on the job:
> RM failure, rc: 15046, msg: 'Resource temporarily unavailable'
> And then it is deferred (for an hour by default).
>
>
> If this is a bug, I'll submit a ticket.
> If the developers have changed something to improve things elsewhere
> (there has been a lot of work to improve the large scale side of Torque)
> then perhaps this has led to the issue.
> Perhaps there is something I can do to work around the issue?
> It does take a while before the node is finally detected as being down.
>
> Note: It takes about 150 seconds for it to show up as "down" from "free"
> after starting pbs_server. I have node_check_rate = 150 in the pbs
> settings so I expect that is the trigger period. The default for pbs now
> appears to be 600 seconds.
>
>
> Cheers,
> Craig.
>
> --
> Craig West                   Systems Manager
> Victorian Partnership for Advanced Computing
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev
>



-- 
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20120713/cac0bc7b/attachment.html 


More information about the torquedev mailing list