[torqueusers] UC multiple jobs: yes, no, maybe

Rushton Martin JMRUSHTON at qinetiq.com
Wed Jan 2 04:08:31 MST 2013


Be careful with changing NODEACCESSPOLICY.  We have had to set it to UNIQUEUSER (under MOAB) which only allows one job per user on a given node.  The job epilogue does clean up of shared memory structures based upon the username, not the process.  We found this out the hard way when user jobs just stopped dead, due to clean up from another job from the same user.  The manual recommends UNIQUEUSER and we haven't had that particular problem since.  I realise this is exactly what Jack's users do NOT want and the problem won't occur for serial jobs, only for multiple parallel jobs on a given node.  Caveat administrator!

Regards,

Martin

-----Original Message-----
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of André Gemünd
Sent: 21 December 2012 07:25
To: Torque Users Mailing List
Subject: Re: [torqueusers] multiple jobs: yes, no, maybe

Hi Jack,

----- Ursprüngliche Mail -----
> The user wanted to know if it was possible to have more than one job 
> running on a node at a time? I honestly didn’t have an answer for him.

it is possible. If you use Maui, you can set NODEACCESSPOLICY to shared (allow multiple jobs) or singleuser (allow multiple jobs from same user). That way you can allow one job per core.
The problem is usually misbehaving software. A job can request only one core, but still use the whole node. E.g. codes that use OpenMP will usually use the whole machine.

The first step against that would be to set the node availability computation to combined (respecting utilized resources instead of only "reserved" resources): http://www.adaptivecomputing.com/resources/docs/maui/5.4nodeavailability.php
Then you can restrict the available resources of a job using CPUsets (http://www.adaptivecomputing.com/resources/docs/torque/2-5-12/help.htm#topics/3-nodes/linuxCpusetSupport.htm). 

> My feeling is that the whole purpose of the cluster is to give the 
> power of a whole node to process a job and not have to share it.

Really depends on the workload. If you have many single core independent jobs (more HTC nature), you'll want to allow one job per core. You could set your queue resource default to allocate a whole node, but let users specifically request single cores, or even use a dedicated queue for that.

Greetings
André

--
André Gemünd
Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de
Tel: +49 2241 14-2193
/C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend _______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

This email and any attachments to it may be confidential and are
intended solely for the use of the individual to whom it is 
addressed. If you are not the intended recipient of this email,
you must neither take any action based upon its contents, nor 
copy or show it to anyone. Please contact the sender if you 
believe you have received this email in error. QinetiQ may 
monitor email traffic data and also the content of email for 
the purposes of security. QinetiQ Limited (Registered in England
& Wales: Company Number: 3796233) Registered office: Cody Technology 
Park, Ively Road, Farnborough, Hampshire, GU14 0LX  http://www.qinetiq.com.


More information about the torqueusers mailing list