[torqueusers] Fwd: Multi-req job not starting

Kunal Rao kunalgrao at gmail.com
Fri Jun 1 13:57:22 MDT 2012


I removed NODEALLOCATIONPOLICY and tried again, this time it started the
job but the node allocation was not as expected.

The job needs 1 node with 2 proc and 3 nodes with 1 proc each. The
allocation was done on only 3 nodes. 2 with 2 procs and 1 with 1 proc. Not
sure if this is a bug or some conflicts in the configuration.

My current additional configurations are :

BACKFILLPOLICY        FIRSTFIT
RESERVATIONPOLICY     CURRENTHIGHEST

ENABLEMULTIREQJOBS TRUE
JOBNODEMATCHPOLICY EXACTNODE
NODEACCESSPOLICY SINGLEJOB

I also tried with this, but still the same :

BACKFILLPOLICY        FIRSTFIT
RESERVATIONPOLICY     CURRENTHIGHEST

ENABLEMULTIREQJOBS TRUE
NODEALLOCATIONPOLICY PRIORITY
NODECFG[DEFAULT] PRIORITYF='APROCS'
JOBNODEMATCHPOLICY EXACTNODE
NODEACCESSPOLICY SINGLEJOB

Any suggestions ?

Thanks,
Kunal


On Thu, May 31, 2012 at 10:26 PM, Kunal Rao <kunalgrao at gmail.com> wrote:

> I need NODEACCESSPOLICY, maybe I'll remove NODEALLOCATIONPOLICY and check
> tomorrow.
>
> Thanks,
> Kunal
>
>
> On Thu, May 31, 2012 at 10:23 PM, Ju JiaJia <jujj603 at gmail.com> wrote:
>
>> Seems all be ok. I think you could try to delete the additional
>> configuration in maui.cfg. like NODEALLOCATIONPOLICY, NODEACCESSPOLICY,
>> or use default or other options.
>>
>>
>> On Fri, Jun 1, 2012 at 9:59 AM, Kunal Rao <kunalgrao at gmail.com> wrote:
>>
>>> Each node has 16 cores. TORQUE_HOME/sever_priv/nodes file has for each
>>> of the 10 nodes :
>>>
>>> <node_name> np=16 gpus=1
>>>
>>> Thanks,
>>> Kunal
>>>
>>>
>>>  On Thu, May 31, 2012 at 9:54 PM, Ju JiaJia <jujj603 at gmail.com> wrote:
>>>
>>>> How many cores on each of the 10 nodes ? I mean you are trying to
>>>> allocate 2 processors on one node. And how did you
>>>> configure TORQUE_HOME/server_priv/nodes ?
>>>>
>>>>
>>>> On Fri, Jun 1, 2012 at 8:54 AM, Kunal Rao <kunalgrao at gmail.com> wrote:
>>>>
>>>>> Queue / Server configuration :
>>>>>
>>>>> ---------------
>>>>>
>>>>> qmgr -c 'p s'
>>>>> #
>>>>> # Create queues and set their attributes.
>>>>> #
>>>>> #
>>>>> # Create and define queue batch
>>>>> #
>>>>> create queue batch
>>>>> set queue batch queue_type = Execution
>>>>> set queue batch resources_default.nodes = 1
>>>>> set queue batch resources_default.walltime = 01:00:00
>>>>> set queue batch enabled = True
>>>>> set queue batch started = True
>>>>> #
>>>>> # Set server attributes.
>>>>> #
>>>>> set server scheduling = True
>>>>> set server acl_hosts = fire16
>>>>> set server acl_roots = root at fire16.csa.local
>>>>> set server managers = root at fire16.csa.local
>>>>> set server operators = root at fire16.csa.local
>>>>> set server default_queue = batch
>>>>> set server log_events = 511
>>>>> set server mail_from = adm
>>>>> set server scheduler_iteration = 20
>>>>> set server node_check_rate = 150
>>>>> set server tcp_timeout = 6
>>>>> set server mom_job_sync = True
>>>>> set server keep_completed = 300
>>>>> set server allow_node_submit = True
>>>>> set server next_job_number = 6331
>>>>>
>>>>> ---------------
>>>>>
>>>>> Job resource requirement :
>>>>>
>>>>> ---------
>>>>>
>>>>> #PBS -l nodes=1:ppn=2+3,walltime=0:05:00
>>>>>
>>>>> ---------
>>>>>
>>>>> "pbsnodes -a" shows all the 10 nodes in "free" state. So, they are all
>>>>> accessible.
>>>>>
>>>>> Thanks,
>>>>> Kunal
>>>>>
>>>>>
>>>>> On 5/31/12, Ju JiaJia <jujj603 at gmail.com> wrote:
>>>>> > Please give your queue/server configuration and your job's resources
>>>>> need,
>>>>> > cpu/memory etc.  And Does all the 10 nodes accessable? You can use
>>>>> pbsnodes
>>>>> > to check this.
>>>>> >
>>>>> > On Thu, May 31, 2012 at 10:53 PM, Kunal Rao <kunalgrao at gmail.com>
>>>>> wrote:
>>>>> >
>>>>> >> Hello,
>>>>> >>
>>>>> >> Please see the below message. I had posted it on maui users mailing
>>>>> list,
>>>>> >> but did not get any response, so thought of posting it here on
>>>>> torque
>>>>> >> users
>>>>> >> mailing list (incase someone would know). Kindly let me know if you
>>>>> have
>>>>> >> any comments / ideas / suggestions.
>>>>> >>
>>>>> >> Thanks,
>>>>> >> Kunal
>>>>> >>
>>>>> >> ---------- Forwarded message ----------
>>>>> >> From: Kunal Rao <kunalgrao at gmail.com>
>>>>> >> Date: Wed, May 23, 2012 at 2:30 PM
>>>>> >> Subject: Re: Multi-req job not starting
>>>>> >> To: mauiusers at supercluster.org
>>>>> >>
>>>>> >>
>>>>> >> There was a similar post earlier :
>>>>> >>
>>>>> http://www.clusterresources.com/pipermail/mauiusers/2009-July/003930.html
>>>>> >>
>>>>> >> But did not find any response to it. Can anyone please provide some
>>>>> ideas
>>>>> >> / suggestion on this issue.
>>>>> >>
>>>>> >> Thanks,
>>>>> >> Kunal
>>>>> >>
>>>>> >>
>>>>> >> On Wed, May 23, 2012 at 2:26 PM, Kunal Rao <kunalgrao at gmail.com>
>>>>> wrote:
>>>>> >>
>>>>> >>> Hello,
>>>>> >>>
>>>>> >>> I have a 10 node cluster. There are 3 jobs. 1 which needs 2 nodes
>>>>> ( with
>>>>> >>> 1 task per node ), another which needs 4 nodes (with 1 task per
>>>>> node)
>>>>> >>> and
>>>>> >>> the third one which needs 4 nodes (  with 2 task on 1 node and 1
>>>>> task
>>>>> >>> each
>>>>> >>> on the other 3 nodes ).
>>>>> >>>
>>>>> >>> Additional configuration in maui.cfg is :
>>>>> >>>
>>>>> >>> BACKFILLPOLICY        FIRSTFIT
>>>>> >>> RESERVATIONPOLICY     CURRENTHIGHEST
>>>>> >>>
>>>>> >>> ENABLEMULTIREQJOBS TRUE
>>>>> >>> NODEALLOCATIONPOLICY  MINRESOURCE
>>>>> >>> NODEACCESSPOLICY SINGLEJOB
>>>>> >>> JOBNODEMATCHPOLICY EXACTNODE
>>>>> >>>
>>>>> >>> I am observing that if the first 2 jobs are running, the third one
>>>>> does
>>>>> >>> not start ( even though 4 nodes are available ) until 1 of the jobs
>>>>> >>> complete. With checkjob -v <job_id> it shows the following output :
>>>>> >>>
>>>>> >>> ------------------
>>>>> >>>
>>>>> >>> checking job 5791 (RM job '5791.fire16.csa.local')
>>>>> >>>
>>>>> >>> State: Idle
>>>>> >>> Creds:  user:kunal  group:kunal  class:batch  qos:DEFAULT
>>>>> >>> WallTime: 00:00:00 of 00:04:51
>>>>> >>> SubmitTime: Wed May 23 11:52:04
>>>>> >>>   (Time Queued  Total: 00:48:52  Eligible: 00:48:52)
>>>>> >>>
>>>>> >>> StartDate: 00:00:01  Wed May 23 12:40:57
>>>>> >>> Total Tasks: 2
>>>>> >>>
>>>>> >>> Req[0]  TaskCount: 2  Partition: ALL
>>>>> >>> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
>>>>> >>> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
>>>>> >>> Exec:  ''  ExecSize: 0  ImageSize: 0
>>>>> >>> Dedicated Resources Per Task: PROCS: 1
>>>>> >>> NodeAccess: SINGLEJOB
>>>>> >>> TasksPerNode: 2  NodeCount: 1
>>>>> >>>
>>>>> >>> Req[1]  TaskCount: 3  Partition: ALL
>>>>> >>> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
>>>>> >>> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
>>>>> >>> Exec:  ''  ExecSize: 0  ImageSize: 0
>>>>> >>> Dedicated Resources Per Task: PROCS: 1
>>>>> >>> NodeAccess: SINGLEJOB
>>>>> >>> NodeCount: 3
>>>>> >>>
>>>>> >>>
>>>>> >>> IWD: [NONE]  Executable:  [NONE]
>>>>> >>> Bypass: 5  StartCount: 0
>>>>> >>> PartitionMask: [ALL]
>>>>> >>> Flags:       RESTARTABLE
>>>>> >>>
>>>>> >>> Reservation '5791' (00:00:01 -> 00:04:52  Duration: 00:04:51)
>>>>> >>> PE:  5.00  StartPriority:  48
>>>>> >>> cannot select job 5791 for partition DEFAULT (startdate in
>>>>> '00:00:01')
>>>>> >>>
>>>>> >>> ------------
>>>>> >>>
>>>>> >>> What could be the reason for not starting this job ? How do I
>>>>> resolve
>>>>> >>> this ?
>>>>> >>>
>>>>> >>> Thanks,
>>>>> >>> Kunal
>>>>> >>>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> _______________________________________________
>>>>> >> torqueusers mailing list
>>>>> >> torqueusers at supercluster.org
>>>>> >> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>> >>
>>>>> >>
>>>>> >
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120601/914c1b38/attachment-0001.html 


More information about the torqueusers mailing list