[torquedev] nodes, procs, tpn and ncpus

Glen Beane glen.beane at gmail.com
Wed Jun 9 10:06:12 MDT 2010


On Wed, Jun 9, 2010 at 11:52 AM, Ken Nielson
<knielson at adaptivecomputing.com> wrote:
> On 06/09/2010 09:09 AM, "Mgr. Šimon Tóth" wrote:
>>
>> Dne 9.6.2010 16:52, Ken Nielson napsal(a):
>>
>>>
>>> On 06/09/2010 08:43 AM, "Mgr. Šimon Tóth" wrote:
>>>
>>>>
>>>> Dne 9.6.2010 16:40, Ken Nielson napsal(a):
>>>>
>>>>
>>>>>
>>>>> On 06/09/2010 07:45 AM, Glen Beane wrote:
>>>>>
>>>>>
>>>>>>>
>>>>>>> I am going to modify TORQUE so it will process these resources more
>>>>>>> like we expect.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>   procs=x will mean give me x processors anywhere.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> great
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>>   nodes=x will mean the same as procs=x.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> I don't think this should be the case... Moab reinterprets it to mean
>>>>>> the same thing, but historically with PBS that is not how has been
>>>>>> interpreted.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>>   nodes=x:ppn=x will work as it currently does except that the
>>>>>>>> value for nodes will not be ignored.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> what do you mean the value for nodes will not be ignored???  The value
>>>>>> for nodes is NOT ignored now.
>>>>>>
>>>>>>
>>>>>> gbeane at wulfgar:~>    echo "sleep 60" | qsub -l
>>>>>> nodes=2:ppn=4,walltime=00:01:00
>>>>>> 69792.wulfgar.jax.org
>>>>>> gbeane at wulfgar:~>    qrun 69792
>>>>>> gbeane at wulfgar:~>    qstat -f 69792
>>>>>> ...
>>>>>>       exec_host =
>>>>>> cs-prod-2/3+cs-prod-2/2+cs-prod-2/1+cs-prod-2/0+cs-prod-1/3+cs
>>>>>>     -prod-1/2+cs-prod-1/1+cs-prod-1/0
>>>>>> ...
>>>>>>       Resource_List.neednodes = 2:ppn=4
>>>>>>       Resource_List.nodect = 2
>>>>>>       Resource_List.nodes = 2:ppn=4
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> It seems you and Simon agree about how TORQUE is working. Following is
>>>>> what I have in qmgr.
>>>>>
>>>>> #
>>>>> # Create queues and set their attributes.
>>>>> #
>>>>> #
>>>>> # Create and define queue batch
>>>>> #
>>>>> create queue batch
>>>>> set queue batch queue_type = Execution
>>>>> set queue batch resources_default.nodes = 1
>>>>> set queue batch resources_default.walltime = 01:00:00
>>>>> set queue batch enabled = True
>>>>> set queue batch started = True
>>>>> #
>>>>> # Set server attributes.
>>>>> #
>>>>> set server scheduling = True
>>>>> set server acl_host_enable = True
>>>>> set server acl_hosts = l18
>>>>> set server acl_hosts += L18
>>>>> set server acl_hosts += kmn
>>>>> set server managers = ken at kmn
>>>>> set server operators = ken at kmn
>>>>> set server default_queue = batch
>>>>> set server log_events = 511
>>>>> set server mail_from = adm
>>>>> set server resources_available.nodect = 1024
>>>>> set server scheduler_iteration = 600
>>>>> set server node_check_rate = 150
>>>>> set server tcp_timeout = 6
>>>>> set server log_level = 6
>>>>> set server mom_job_sync = True
>>>>> set server keep_completed = 30
>>>>> set server next_job_number = 100
>>>>>
>>>>> Whenever I do -l nodes=x:ppn=y where x is greater than 1 I still only
>>>>> get one node allocated to the job.
>>>>>
>>>>>
>>>>
>>>> Well, what scheduler are you using? Schedulers can completely mask the
>>>> original nodespec. They can send their own nodespec in the run request.
>>>>
>>>>
>>>>
>>>
>>> I am not using any scheduler. I run my jobs by hand. The scheduler will
>>> supersede any TORQUE interpretation.
>>>
>>
>> Well, then its just weird. Can you post the server log?
>>
>>
>
> Simon,
>
> Attached are two server logs.
>
> node2ppn2 comes from the command "qsub -l nodes=2:ppn=2 psaux"  The
> resulting job id for this is 101.kmn.
>
> nodes3 comes from the command "qsub -l nodes=3 psaux".  The job id to look
> for in this is 102.kmn.
>
> Following is the qstat output for node2ppn2
> Job Id: 101.kmn
>    Job_Name = psaux
>    Job_Owner = ken at kmn
>    resources_used.cput = 00:00:00
>    resources_used.mem = 0kb
>    resources_used.vmem = 0kb
>    resources_used.walltime = 00:00:00
>    job_state = C
>    queue = batch
>    server = kmn
>    Checkpoint = u
>    ctime = Wed Jun  9 09:24:12 2010
>    Error_Path = kmn:/home/ken/psaux.e101
>    exec_host = kmn/1+kmn/0
>    Hold_Types = n
>    Join_Path = n
>    Keep_Files = n
>    Mail_Points = a
>    mtime = Wed Jun  9 09:24:16 2010
>    Output_Path = kmn:/home/ken/psaux.o101
>    Priority = 0
>    qtime = Wed Jun  9 09:24:12 2010
>    Rerunable = True
>    Resource_List.neednodes = 2:ppn=2
>    Resource_List.nodect = 2
>    Resource_List.nodes = 2:ppn=2
>    Resource_List.walltime = 00:01:00
>    session_id = 2800
>    substate = 59
>    Variable_List = PBS_O_HOME=/home/ken,PBS_O_LANG=en_US.utf8,
>    PBS_O_LOGNAME=ken,
>    PBS_O_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/b
>    in:/usr/games:/usr/local/lib:usr/local/bin:/usr/local/lib/:/opt/slicke
>    dit/bin:/opt/moab/sbin/:/opt/moab/bin/,PBS_O_SHELL=/bin/bash,
>    PBS_O_HOST=kmn,PBS_SERVER=kmn,PBS_O_WORKDIR=/home/ken,
>    PBS_O_QUEUE=batch
>    euser = ken
>    egroup = ken
>    hashname = 101.kmn
>    queue_rank = 1
>    queue_type = E
>    etime = Wed Jun  9 09:24:12 2010
>    exit_status = 0
>    submit_args = -l nodes=2:ppn=2 psaux
>    start_time = Wed Jun  9 09:24:16 2010
>    Walltime.Remaining = 5
>    start_count = 1
>    fault_tolerant = False
>    comp_time = Wed Jun  9 09:24:16 2010
>
> Following is the qstat output for nodes3Job Id: 102.kmn
>    Job_Name = psaux
>    Job_Owner = ken at kmn
>    resources_used.cput = 00:00:00
>    resources_used.mem = 0kb
>    resources_used.vmem = 0kb
>    resources_used.walltime = 00:00:00
>    job_state = C
>    queue = batch
>    server = kmn
>    Checkpoint = u
>    ctime = Wed Jun  9 09:45:11 2010
>    Error_Path = kmn:/home/ken/psaux.e102
>    exec_host = kmn/0
>    Hold_Types = n
>    Join_Path = n
>    Keep_Files = n
>    Mail_Points = a
>    mtime = Wed Jun  9 09:45:16 2010
>    Output_Path = kmn:/home/ken/psaux.o102
>    Priority = 0
>    qtime = Wed Jun  9 09:45:11 2010
>    Rerunable = True
>    Resource_List.neednodes = 3
>    Resource_List.nodect = 3
>    Resource_List.nodes = 3
>    Resource_List.walltime = 00:01:00
>    session_id = 3000
>    substate = 59
>    Variable_List = PBS_O_HOME=/home/ken,PBS_O_LANG=en_US.utf8,
>    PBS_O_LOGNAME=ken,
>    PBS_O_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/b
>    in:/usr/games:/usr/local/lib:usr/local/bin:/usr/local/lib/:/opt/slicke
>    dit/bin:/opt/moab/sbin/:/opt/moab/bin/,PBS_O_SHELL=/bin/bash,
>    PBS_O_HOST=kmn,PBS_SERVER=kmn,PBS_O_WORKDIR=/home/ken,
>    PBS_O_QUEUE=batch
>    euser = ken
>    egroup = ken
>    hashname = 102.kmn
>    queue_rank = 1
>    queue_type = E
>    etime = Wed Jun  9 09:45:11 2010
>    exit_status = 0
>    submit_args = -l nodes=3 psaux
>    start_time = Wed Jun  9 09:45:16 2010
>    Walltime.Remaining = 5
>    start_count = 1
>    fault_tolerant = False
>    comp_time = Wed Jun  9 09:45:16 2010
>
> Thanks for taking a look.
>


what version of TORQUE?  seems like a new bug.


More information about the torquedev mailing list