[torqueusers] queues are not working after master node reboot

Jerry Smith jdsmit at sandia.gov
Wed Mar 16 11:36:03 MDT 2011


No problem with the long email, details help.

Do Moab and Torque agree on the node settings?

Comparing a pbsnodes -a nodeX and a checknode -v nodeX, do they agree on 
the features/settings?

Comparing qmgr -c "p s" | grep queue and mdiag -c -v agree on min/max 
settings?

What versions of Torque and Moab are you using ?

Jerry

Sreedhar Manchu wrote:
> Hi Jerry,
>
> Sorry. I should have been clear about the problem. Queues are 
> definitely visible. We have two different queues  p12, p48 and bigmem 
> based on walltime and memory requirements. In the nodes file we have 
> specified three different attributes to all the nodes. These 
> attributes are parallel, quad-core and bigmem. The nodes with np=8 
> have both parallel and quad-core attributes. The big memory nodes have 
> bigmem attribute. The nodes with np=12 have just parallel attribute.
>
> The p12 queue has max walltime of 12 hours and parallel attribute 
> (resources_default.neednodes = parallel). Where as p48 has max 
> walltime of 48 hours and quad-core attribute 
> (resources_default.neednodes = quad-core). bigmem queue has max memory 
> and min memory set to specfic values with an node attribute bigmem 
> (resources_default.neednodes = bigmem). The default queue is route on 
> our cluster with route destinations in the order of p12 and p48.
>
> The problem is even when jobs are submitted with specific queue 
> requirement they end up on the wrong nodes. For example, according to 
> our queue settings p48 (#PBS -q p48) jobs, i.e., 48 hour jobs with 
> processor requirement of just ppn=8 should go on to the nodes with an 
> attribute quad-core. Like I wrote in my first post it used to be 
> working fine like it should. But after the reboot all these 48 hours 
> jobs with queue p48 end up on 12 cpu nodes (only 12 hour jobs should 
> run on the p12 queue as the max walltime on this queue is 12 hours). 
>
> Big memory jobs where queue bigmem is explicitly mentioned with 
> specific memory requirements that suit nodes with bigmem attribute. 
> But these jobs are going on to nodes with less memory (p12 nodes have 
> just 24GB of memory). So these jobs don't run well as memory is not 
> enough for the operation.
>
> Even routing is not working properly through the route queue. The jobs 
> are being routed to all the queues. If no queue is mentioned the jobs 
> should be routed through the maximum walltime setting of the queue. I 
> hope I am right here. But the jobs with 48 hour walltime are going 
> onto p12 which has just 12 hour maximum walltime.
>
> So jobs are not going onto right nodes with or without mentioning the 
> queue in the pbs scripts. All of it was working well until we rebooted 
> the master node. After that they are just going everywhere.
>
> I see pbsnodes -a giving right output with node attributes and queues 
> it should have. Everything looks fine but jobs end up on wrong nodes.
>
> I hope I explained the problem well. I am sorry if I went overboard in 
> explaining and ended up with long email.
>
> Thanks,
> Sreedhar.
>
> On Mar 16, 2011, at 12:22 PM, Jerry Smith wrote:
>
>> Sreedhar,
>>
>> Can you define "queue settings are not working"?
>>
>> Are jobs not starting?  Are the queues no longer visible? Are they 
>> showing the wrong nodes?
>>
>> A little more detail and we can probably get you to resolution faster.
>>
>> Jerry
>>
>> Sreedhar Manchu wrote:
>>> Hi Steve,
>>>
>>> First, thank you for writing. We have just 6 queues. Could you please clarify on modifying include files? If I can resolve it without having to rebuild torque it would be great. If that is the only solution, then I guess I will have to.
>>>
>>> Thanks once again. I look forward to your reply.
>>>
>>> Regards,
>>> Sreedhar.
>>>
>>> On Mar 16, 2011, at 12:12 PM, Steve Crusan wrote:
>>>
>>>   
>>>> On 3/16/11 9:35 AM, "Sreedhar Manchu" <sm4082 at nyu.edu> wrote:
>>>>
>>>>     
>>>>> Hello Everyone,
>>>>>
>>>>> My name is Sreedhar. I am new to this mailing list. I have a quick question on
>>>>> queues. I would really appreciate it if some one could help me with it. Very
>>>>> recently, we rebooted the master node. Since then the queue settings are not
>>>>> working on our cluster. It used to be fine until the reboot. We haven't
>>>>> changed anything in settings. Moab is the scheduler. I have tried to restart
>>>>> both pbs and moab and still jobs end up in the wrong queue.
>>>>>       
>>>> How many queues do you have? We had similar problems on our dev cluster, one
>>>> that we had more than 16 queues, and ended up having to modify some include
>>>> files + rebuild torque.
>>>>
>>>>
>>>>     
>>>>> I have looked into documentation but didn't find anything related to this type
>>>>> of problem. I would really appreciate if if some one could help me.
>>>>>
>>>>> Thanks,
>>>>> Sreedhar.
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>       
>>>> ----------------------
>>>> Steve Crusan
>>>> System Administrator
>>>> Center for Research Computing
>>>> University of Rochester
>>>> https://www.crc.rochester.edu/
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>     
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>>   
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110316/da2f5a2f/attachment-0001.html 


More information about the torqueusers mailing list