[torqueusers] queues are not working after master node reboot
Jerry Smith
jdsmit at sandia.gov
Wed Mar 16 11:36:03 MDT 2011
No problem with the long email, details help.
Do Moab and Torque agree on the node settings?
Comparing a pbsnodes -a nodeX and a checknode -v nodeX, do they agree on
the features/settings?
Comparing qmgr -c "p s" | grep queue and mdiag -c -v agree on min/max
settings?
What versions of Torque and Moab are you using ?
Jerry
Sreedhar Manchu wrote:
> Hi Jerry,
>
> Sorry. I should have been clear about the problem. Queues are
> definitely visible. We have two different queues p12, p48 and bigmem
> based on walltime and memory requirements. In the nodes file we have
> specified three different attributes to all the nodes. These
> attributes are parallel, quad-core and bigmem. The nodes with np=8
> have both parallel and quad-core attributes. The big memory nodes have
> bigmem attribute. The nodes with np=12 have just parallel attribute.
>
> The p12 queue has max walltime of 12 hours and parallel attribute
> (resources_default.neednodes = parallel). Where as p48 has max
> walltime of 48 hours and quad-core attribute
> (resources_default.neednodes = quad-core). bigmem queue has max memory
> and min memory set to specfic values with an node attribute bigmem
> (resources_default.neednodes = bigmem). The default queue is route on
> our cluster with route destinations in the order of p12 and p48.
>
> The problem is even when jobs are submitted with specific queue
> requirement they end up on the wrong nodes. For example, according to
> our queue settings p48 (#PBS -q p48) jobs, i.e., 48 hour jobs with
> processor requirement of just ppn=8 should go on to the nodes with an
> attribute quad-core. Like I wrote in my first post it used to be
> working fine like it should. But after the reboot all these 48 hours
> jobs with queue p48 end up on 12 cpu nodes (only 12 hour jobs should
> run on the p12 queue as the max walltime on this queue is 12 hours).
>
> Big memory jobs where queue bigmem is explicitly mentioned with
> specific memory requirements that suit nodes with bigmem attribute.
> But these jobs are going on to nodes with less memory (p12 nodes have
> just 24GB of memory). So these jobs don't run well as memory is not
> enough for the operation.
>
> Even routing is not working properly through the route queue. The jobs
> are being routed to all the queues. If no queue is mentioned the jobs
> should be routed through the maximum walltime setting of the queue. I
> hope I am right here. But the jobs with 48 hour walltime are going
> onto p12 which has just 12 hour maximum walltime.
>
> So jobs are not going onto right nodes with or without mentioning the
> queue in the pbs scripts. All of it was working well until we rebooted
> the master node. After that they are just going everywhere.
>
> I see pbsnodes -a giving right output with node attributes and queues
> it should have. Everything looks fine but jobs end up on wrong nodes.
>
> I hope I explained the problem well. I am sorry if I went overboard in
> explaining and ended up with long email.
>
> Thanks,
> Sreedhar.
>
> On Mar 16, 2011, at 12:22 PM, Jerry Smith wrote:
>
>> Sreedhar,
>>
>> Can you define "queue settings are not working"?
>>
>> Are jobs not starting? Are the queues no longer visible? Are they
>> showing the wrong nodes?
>>
>> A little more detail and we can probably get you to resolution faster.
>>
>> Jerry
>>
>> Sreedhar Manchu wrote:
>>> Hi Steve,
>>>
>>> First, thank you for writing. We have just 6 queues. Could you please clarify on modifying include files? If I can resolve it without having to rebuild torque it would be great. If that is the only solution, then I guess I will have to.
>>>
>>> Thanks once again. I look forward to your reply.
>>>
>>> Regards,
>>> Sreedhar.
>>>
>>> On Mar 16, 2011, at 12:12 PM, Steve Crusan wrote:
>>>
>>>
>>>> On 3/16/11 9:35 AM, "Sreedhar Manchu" <sm4082 at nyu.edu> wrote:
>>>>
>>>>
>>>>> Hello Everyone,
>>>>>
>>>>> My name is Sreedhar. I am new to this mailing list. I have a quick question on
>>>>> queues. I would really appreciate it if some one could help me with it. Very
>>>>> recently, we rebooted the master node. Since then the queue settings are not
>>>>> working on our cluster. It used to be fine until the reboot. We haven't
>>>>> changed anything in settings. Moab is the scheduler. I have tried to restart
>>>>> both pbs and moab and still jobs end up in the wrong queue.
>>>>>
>>>> How many queues do you have? We had similar problems on our dev cluster, one
>>>> that we had more than 16 queues, and ended up having to modify some include
>>>> files + rebuild torque.
>>>>
>>>>
>>>>
>>>>> I have looked into documentation but didn't find anything related to this type
>>>>> of problem. I would really appreciate if if some one could help me.
>>>>>
>>>>> Thanks,
>>>>> Sreedhar.
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>> ----------------------
>>>> Steve Crusan
>>>> System Administrator
>>>> Center for Research Computing
>>>> University of Rochester
>>>> https://www.crc.rochester.edu/
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110316/da2f5a2f/attachment-0001.html
More information about the torqueusers
mailing list