[torqueusers] queues are not working after master node reboot
sm4082 at nyu.edu
Wed Mar 16 10:58:20 MDT 2011
Sorry. I should have been clear about the problem. Queues are definitely visible. We have two different queues p12, p48 and bigmem based on walltime and memory requirements. In the nodes file we have specified three different attributes to all the nodes. These attributes are parallel, quad-core and bigmem. The nodes with np=8 have both parallel and quad-core attributes. The big memory nodes have bigmem attribute. The nodes with np=12 have just parallel attribute.
The p12 queue has max walltime of 12 hours and parallel attribute (resources_default.neednodes = parallel). Where as p48 has max walltime of 48 hours and quad-core attribute (resources_default.neednodes = quad-core). bigmem queue has max memory and min memory set to specfic values with an node attribute bigmem (resources_default.neednodes = bigmem). The default queue is route on our cluster with route destinations in the order of p12 and p48.
The problem is even when jobs are submitted with specific queue requirement they end up on the wrong nodes. For example, according to our queue settings p48 (#PBS -q p48) jobs, i.e., 48 hour jobs with processor requirement of just ppn=8 should go on to the nodes with an attribute quad-core. Like I wrote in my first post it used to be working fine like it should. But after the reboot all these 48 hours jobs with queue p48 end up on 12 cpu nodes (only 12 hour jobs should run on the p12 queue as the max walltime on this queue is 12 hours).
Big memory jobs where queue bigmem is explicitly mentioned with specific memory requirements that suit nodes with bigmem attribute. But these jobs are going on to nodes with less memory (p12 nodes have just 24GB of memory). So these jobs don't run well as memory is not enough for the operation.
Even routing is not working properly through the route queue. The jobs are being routed to all the queues. If no queue is mentioned the jobs should be routed through the maximum walltime setting of the queue. I hope I am right here. But the jobs with 48 hour walltime are going onto p12 which has just 12 hour maximum walltime.
So jobs are not going onto right nodes with or without mentioning the queue in the pbs scripts. All of it was working well until we rebooted the master node. After that they are just going everywhere.
I see pbsnodes -a giving right output with node attributes and queues it should have. Everything looks fine but jobs end up on wrong nodes.
I hope I explained the problem well. I am sorry if I went overboard in explaining and ended up with long email.
On Mar 16, 2011, at 12:22 PM, Jerry Smith wrote:
> Can you define "queue settings are not working"?
> Are jobs not starting? Are the queues no longer visible? Are they showing the wrong nodes?
> A little more detail and we can probably get you to resolution faster.
> Sreedhar Manchu wrote:
>> Hi Steve,
>> First, thank you for writing. We have just 6 queues. Could you please clarify on modifying include files? If I can resolve it without having to rebuild torque it would be great. If that is the only solution, then I guess I will have to.
>> Thanks once again. I look forward to your reply.
>> On Mar 16, 2011, at 12:12 PM, Steve Crusan wrote:
>>> On 3/16/11 9:35 AM, "Sreedhar Manchu" <sm4082 at nyu.edu> wrote:
>>>> Hello Everyone,
>>>> My name is Sreedhar. I am new to this mailing list. I have a quick question on
>>>> queues. I would really appreciate it if some one could help me with it. Very
>>>> recently, we rebooted the master node. Since then the queue settings are not
>>>> working on our cluster. It used to be fine until the reboot. We haven't
>>>> changed anything in settings. Moab is the scheduler. I have tried to restart
>>>> both pbs and moab and still jobs end up in the wrong queue.
>>> How many queues do you have? We had similar problems on our dev cluster, one
>>> that we had more than 16 queues, and ended up having to modify some include
>>> files + rebuild torque.
>>>> I have looked into documentation but didn't find anything related to this type
>>>> of problem. I would really appreciate if if some one could help me.
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>> Steve Crusan
>>> System Administrator
>>> Center for Research Computing
>>> University of Rochester
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>> torqueusers mailing list
>> torqueusers at supercluster.org
> torqueusers mailing list
> torqueusers at supercluster.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers