[torqueusers] Fwd: Multi-req job not starting

Kunal Rao kunalgrao at gmail.com
Fri Jun 1 14:34:09 MDT 2012


Found this post online :
http://www.supercluster.org/pipermail/mauiusers/2010-February/004116.html

I also have JOBNODEMATCHPOLICY EXACTNODE and NODEACCESSPOLICY SINGLEJOB set
in the configuration. Could this bug still be there with maui ?

I tested with a smaller cluster size and let me explain the scenario again :

This time I have a 6 node cluster with Torque-3.0.3 and Maui running.
Additional configuration in my Maui configuration file :

----------
BACKFILLPOLICY        FIRSTFIT
RESERVATIONPOLICY     CURRENTHIGHEST

ENABLEMULTIREQJOBS TRUE
JOBNODEMATCHPOLICY EXACTNODE
NODEACCESSPOLICY SINGLEJOB

----------

Now I submit a job a 2 node job with following resource requirement :

----------
#PBS -l nodes=2,walltime=0:10:00
---------

This job starts on node1/0 + node2/0

Now, I submit another 4 node job with the following resource requirement :

---------
#PBS -l nodes=1:ppn=2+3,walltime=0:05:00
--------

This job is also started but with following resources : *node3/0 + node3/1*+
*node4/0 + node4/1* + node5/0

I would expect this job to use the resources as follows : node3/0 + node3/1
+ node4/0 + node5/0 + node6/0
But it did not use node6 at all, instead it used node3 and node4 to put 2
procs on each of them and node5 with another proc. node6 remained idle.

Is this a bug or some other configuration / setting is required ?

Thanks,
Kunal

On Fri, Jun 1, 2012 at 3:57 PM, Kunal Rao <kunalgrao at gmail.com> wrote:

> I removed NODEALLOCATIONPOLICY and tried again, this time it started the
> job but the node allocation was not as expected.
>
> The job needs 1 node with 2 proc and 3 nodes with 1 proc each. The
> allocation was done on only 3 nodes. 2 with 2 procs and 1 with 1 proc. Not
> sure if this is a bug or some conflicts in the configuration.
>
> My current additional configurations are :
>
>
> BACKFILLPOLICY        FIRSTFIT
> RESERVATIONPOLICY     CURRENTHIGHEST
>
> ENABLEMULTIREQJOBS TRUE
> JOBNODEMATCHPOLICY EXACTNODE
> NODEACCESSPOLICY SINGLEJOB
>
> I also tried with this, but still the same :
>
>
> BACKFILLPOLICY        FIRSTFIT
> RESERVATIONPOLICY     CURRENTHIGHEST
>
> ENABLEMULTIREQJOBS TRUE
> NODEALLOCATIONPOLICY PRIORITY
> NODECFG[DEFAULT] PRIORITYF='APROCS'
> JOBNODEMATCHPOLICY EXACTNODE
> NODEACCESSPOLICY SINGLEJOB
>
> Any suggestions ?
>
> Thanks,
> Kunal
>
>
>
> On Thu, May 31, 2012 at 10:26 PM, Kunal Rao <kunalgrao at gmail.com> wrote:
>
>> I need NODEACCESSPOLICY, maybe I'll remove NODEALLOCATIONPOLICY and check
>> tomorrow.
>>
>> Thanks,
>> Kunal
>>
>>
>> On Thu, May 31, 2012 at 10:23 PM, Ju JiaJia <jujj603 at gmail.com> wrote:
>>
>>> Seems all be ok. I think you could try to delete the additional
>>> configuration in maui.cfg. like NODEALLOCATIONPOLICY, NODEACCESSPOLICY,
>>> or use default or other options.
>>>
>>>
>>> On Fri, Jun 1, 2012 at 9:59 AM, Kunal Rao <kunalgrao at gmail.com> wrote:
>>>
>>>> Each node has 16 cores. TORQUE_HOME/sever_priv/nodes file has for each
>>>> of the 10 nodes :
>>>>
>>>> <node_name> np=16 gpus=1
>>>>
>>>> Thanks,
>>>> Kunal
>>>>
>>>>
>>>>  On Thu, May 31, 2012 at 9:54 PM, Ju JiaJia <jujj603 at gmail.com> wrote:
>>>>
>>>>> How many cores on each of the 10 nodes ? I mean you are trying to
>>>>> allocate 2 processors on one node. And how did you
>>>>> configure TORQUE_HOME/server_priv/nodes ?
>>>>>
>>>>>
>>>>> On Fri, Jun 1, 2012 at 8:54 AM, Kunal Rao <kunalgrao at gmail.com> wrote:
>>>>>
>>>>>> Queue / Server configuration :
>>>>>>
>>>>>> ---------------
>>>>>>
>>>>>> qmgr -c 'p s'
>>>>>> #
>>>>>> # Create queues and set their attributes.
>>>>>> #
>>>>>> #
>>>>>> # Create and define queue batch
>>>>>> #
>>>>>> create queue batch
>>>>>> set queue batch queue_type = Execution
>>>>>> set queue batch resources_default.nodes = 1
>>>>>> set queue batch resources_default.walltime = 01:00:00
>>>>>> set queue batch enabled = True
>>>>>> set queue batch started = True
>>>>>> #
>>>>>> # Set server attributes.
>>>>>> #
>>>>>> set server scheduling = True
>>>>>> set server acl_hosts = fire16
>>>>>> set server acl_roots = root at fire16.csa.local
>>>>>> set server managers = root at fire16.csa.local
>>>>>> set server operators = root at fire16.csa.local
>>>>>> set server default_queue = batch
>>>>>> set server log_events = 511
>>>>>> set server mail_from = adm
>>>>>> set server scheduler_iteration = 20
>>>>>> set server node_check_rate = 150
>>>>>> set server tcp_timeout = 6
>>>>>> set server mom_job_sync = True
>>>>>> set server keep_completed = 300
>>>>>> set server allow_node_submit = True
>>>>>> set server next_job_number = 6331
>>>>>>
>>>>>> ---------------
>>>>>>
>>>>>> Job resource requirement :
>>>>>>
>>>>>> ---------
>>>>>>
>>>>>> #PBS -l nodes=1:ppn=2+3,walltime=0:05:00
>>>>>>
>>>>>> ---------
>>>>>>
>>>>>> "pbsnodes -a" shows all the 10 nodes in "free" state. So, they are all
>>>>>> accessible.
>>>>>>
>>>>>> Thanks,
>>>>>> Kunal
>>>>>>
>>>>>>
>>>>>> On 5/31/12, Ju JiaJia <jujj603 at gmail.com> wrote:
>>>>>> > Please give your queue/server configuration and your job's
>>>>>> resources need,
>>>>>> > cpu/memory etc.  And Does all the 10 nodes accessable? You can use
>>>>>> pbsnodes
>>>>>> > to check this.
>>>>>> >
>>>>>> > On Thu, May 31, 2012 at 10:53 PM, Kunal Rao <kunalgrao at gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> >> Hello,
>>>>>> >>
>>>>>> >> Please see the below message. I had posted it on maui users
>>>>>> mailing list,
>>>>>> >> but did not get any response, so thought of posting it here on
>>>>>> torque
>>>>>> >> users
>>>>>> >> mailing list (incase someone would know). Kindly let me know if
>>>>>> you have
>>>>>> >> any comments / ideas / suggestions.
>>>>>> >>
>>>>>> >> Thanks,
>>>>>> >> Kunal
>>>>>> >>
>>>>>> >> ---------- Forwarded message ----------
>>>>>> >> From: Kunal Rao <kunalgrao at gmail.com>
>>>>>> >> Date: Wed, May 23, 2012 at 2:30 PM
>>>>>> >> Subject: Re: Multi-req job not starting
>>>>>> >> To: mauiusers at supercluster.org
>>>>>> >>
>>>>>> >>
>>>>>> >> There was a similar post earlier :
>>>>>> >>
>>>>>> http://www.clusterresources.com/pipermail/mauiusers/2009-July/003930.html
>>>>>> >>
>>>>>> >> But did not find any response to it. Can anyone please provide
>>>>>> some ideas
>>>>>> >> / suggestion on this issue.
>>>>>> >>
>>>>>> >> Thanks,
>>>>>> >> Kunal
>>>>>> >>
>>>>>> >>
>>>>>> >> On Wed, May 23, 2012 at 2:26 PM, Kunal Rao <kunalgrao at gmail.com>
>>>>>> wrote:
>>>>>> >>
>>>>>> >>> Hello,
>>>>>> >>>
>>>>>> >>> I have a 10 node cluster. There are 3 jobs. 1 which needs 2 nodes
>>>>>> ( with
>>>>>> >>> 1 task per node ), another which needs 4 nodes (with 1 task per
>>>>>> node)
>>>>>> >>> and
>>>>>> >>> the third one which needs 4 nodes (  with 2 task on 1 node and 1
>>>>>> task
>>>>>> >>> each
>>>>>> >>> on the other 3 nodes ).
>>>>>> >>>
>>>>>> >>> Additional configuration in maui.cfg is :
>>>>>> >>>
>>>>>> >>> BACKFILLPOLICY        FIRSTFIT
>>>>>> >>> RESERVATIONPOLICY     CURRENTHIGHEST
>>>>>> >>>
>>>>>> >>> ENABLEMULTIREQJOBS TRUE
>>>>>> >>> NODEALLOCATIONPOLICY  MINRESOURCE
>>>>>> >>> NODEACCESSPOLICY SINGLEJOB
>>>>>> >>> JOBNODEMATCHPOLICY EXACTNODE
>>>>>> >>>
>>>>>> >>> I am observing that if the first 2 jobs are running, the third
>>>>>> one does
>>>>>> >>> not start ( even though 4 nodes are available ) until 1 of the
>>>>>> jobs
>>>>>> >>> complete. With checkjob -v <job_id> it shows the following output
>>>>>> :
>>>>>> >>>
>>>>>> >>> ------------------
>>>>>> >>>
>>>>>> >>> checking job 5791 (RM job '5791.fire16.csa.local')
>>>>>> >>>
>>>>>> >>> State: Idle
>>>>>> >>> Creds:  user:kunal  group:kunal  class:batch  qos:DEFAULT
>>>>>> >>> WallTime: 00:00:00 of 00:04:51
>>>>>> >>> SubmitTime: Wed May 23 11:52:04
>>>>>> >>>   (Time Queued  Total: 00:48:52  Eligible: 00:48:52)
>>>>>> >>>
>>>>>> >>> StartDate: 00:00:01  Wed May 23 12:40:57
>>>>>> >>> Total Tasks: 2
>>>>>> >>>
>>>>>> >>> Req[0]  TaskCount: 2  Partition: ALL
>>>>>> >>> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
>>>>>> >>> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
>>>>>> >>> Exec:  ''  ExecSize: 0  ImageSize: 0
>>>>>> >>> Dedicated Resources Per Task: PROCS: 1
>>>>>> >>> NodeAccess: SINGLEJOB
>>>>>> >>> TasksPerNode: 2  NodeCount: 1
>>>>>> >>>
>>>>>> >>> Req[1]  TaskCount: 3  Partition: ALL
>>>>>> >>> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
>>>>>> >>> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
>>>>>> >>> Exec:  ''  ExecSize: 0  ImageSize: 0
>>>>>> >>> Dedicated Resources Per Task: PROCS: 1
>>>>>> >>> NodeAccess: SINGLEJOB
>>>>>> >>> NodeCount: 3
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> IWD: [NONE]  Executable:  [NONE]
>>>>>> >>> Bypass: 5  StartCount: 0
>>>>>> >>> PartitionMask: [ALL]
>>>>>> >>> Flags:       RESTARTABLE
>>>>>> >>>
>>>>>> >>> Reservation '5791' (00:00:01 -> 00:04:52  Duration: 00:04:51)
>>>>>> >>> PE:  5.00  StartPriority:  48
>>>>>> >>> cannot select job 5791 for partition DEFAULT (startdate in
>>>>>> '00:00:01')
>>>>>> >>>
>>>>>> >>> ------------
>>>>>> >>>
>>>>>> >>> What could be the reason for not starting this job ? How do I
>>>>>> resolve
>>>>>> >>> this ?
>>>>>> >>>
>>>>>> >>> Thanks,
>>>>>> >>> Kunal
>>>>>> >>>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> _______________________________________________
>>>>>> >> torqueusers mailing list
>>>>>> >> torqueusers at supercluster.org
>>>>>> >> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>> >>
>>>>>> >>
>>>>>> >
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> torqueusers at supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120601/3a751d45/attachment-0001.html 


More information about the torqueusers mailing list