[torquedev] Re: [torqueusers] Torque notes from SC'08

Michel Béland michel.beland at rqchp.qc.ca
Tue Jan 6 21:55:05 MST 2009


Chris Samuel wrote:

>Apologies for taking so long to reply,
>

Same for me.

>>>>The original memory and cpu requirements should always be kept
>>>>in case the job needs to be restarted.
>>>>        
>>>>
>>>The nodes and mem will be, do you mean the vnode and NUMA node
>>>allocations too ?
>>>      
>>>
>>No, not the node allocations. If one asks for 4 cpus and 20 GB
>>on our Altix 4700, the job will get three nodes (12 cpus and a
>>little less than 23 GB, because some memory has to be given to
>>the operating system).
>>    
>>
>
>Now at that point you're talking NUMA nodes and not compute
>nodes, yes ?
>
>  
>

Indeed, I am talking NUMA nodes. Sorry if I was not clear.

>Does the job get to access all 12 of those CPUs are are they
>just marked as inaccessible to other jobs ?
>  
>

As we boldly started moving from PBS Pro to Torque yesterday, I cannot 
check what PBS Pro did and I do not remember for sure, but I think that 
all the cpus in the nodes are in the cpuset of the job. So in theory the 
processes could move from one processor to another to be closer to 
memory, but we typically use dplace, a tool that allows to pin processes 
to particular cpus.

>>One way to achieve this on Torque might be to increase the cpu
>>and memory requirements to request complete nodes.
>>    
>>
>
>But if you had (say) an MPI job partitioned for N cores
>and you were using a launcher that was TM aware then the
>user might not be happy to find that his careful work has
>been thwarted.
>
>That might just be a user education issue though ("well
>scale it up to X cores instead then").
>
>  
>
Even with mpiexec, you can still use option -n to use fewer cpus than 
allocated to the job by Torque. On Altix, our MPI library is SGI MPT, 
which is not TM aware.

>[...]
>  
>
>>The /sys/devices/system/node/node*/meminfo files show small
>>variations. If it is done by the scheduler when the job is
>>scheduled to run, it can fill the selected nodes all right,
>>but if the job is restarted for some reason, it might run on
>>nodes with slightly less memory, forcing the scheduler to
>>request another node for the job while it is not really needed.
>>    
>>
>
>At the moment with Torque that might be academic as the BLCR
>checkpoint restart work scheduled to be in 2.4 doesn't support
>parallel jobs (as you need support in the MPI stacks for instance).
>
>But in the general case yes, I can see that happening and I
>can't really see of a way around it, the system is very unlikely
>to be close to the state it was in when the job was suspended.
>
>  
>
I was not talking about checkpoint-restart, but just requeueing. Maybe I 
should have written "rerunable" instead of "restartable". The point is 
the same, anyway.

>>We have seen this happen with an old version of PBS Pro on
>>our Altix machines.
>>    
>>
>
>It might be a necessary price to pay for C/R or S/R
>with jobs on these large systems. :-(
>  
>

It seems that we do not really have to pay this price. As I wrote above, 
we started the migration from PBS Pro to Torque yesterday and here is 
what we ended up doing:

- modify src/resmom/linux/cpuset.c to have the same nodes for mem and 
cpus in the cpuset (the original code uses all the nodes for mem),
- write a qsub wrapper script calling /usr/torque/bin/qsub to make sure 
that jobs take complete nodes for cpus and memory (this seems simple, 
said like that, but the script is getting quite involved).

Michel Béland
Réseau québécois de calcul de haute performance


More information about the torquedev mailing list