[torqueusers] torqueusers Digest, Vol 69, Issue 28

alap pandya arrow1533 at gmail.com
Fri Apr 30 03:50:19 MDT 2010


 James ,

Thanks for detailed mail.
I am using Torque version 2.4.6 with maui 3.3

As you suggested i set my /var/spool/torque/server_priv/nodes

n02 ppn=8
n01 ppn=8

I stopped pbs_server and while restarting i am getting this error

PBS_Server: LOG_ERROR::pbsd_init(setup_nodes), could not create node "n02",
error = 15002
PBS_Server: LOG_ERROR::PBS_Server, pbsd_init failed


if  i set
n02 np=8
n01 np=8
pbs_server restart successfully.
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||

I also tried with below mentioned setting in maui configuration but still *more
than one job are running on each node.*

NODECFG[n01]    MAXJOB=1
NODECFG[n02]    MAXJOB=1

Please let me know your suggestions.

With Regards,
Alap





On Fri, Apr 30, 2010 at 4:34 AM, <torqueusers-request at supercluster.org>wrote:

> Send torqueusers mailing list submissions to
>        torqueusers at supercluster.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://www.supercluster.org/mailman/listinfo/torqueusers
> or, via email, send a message with subject or body 'help' to
>        torqueusers-request at supercluster.org
>
> You can reach the person managing the list at
>        torqueusers-owner at supercluster.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of torqueusers digest..."
>
>
> Today's Topics:
>
>   1. Re: Torque configuration for single node -single job (Ken Nielson)
>   2. Re: Torque configuration for single node -single job
>      (Coyle, James J [ITACD])
>   3. Re: Question about        the     difference      between a       node
>    where
>      pbs_server is run and a   compute node (Garrick Staples)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 29 Apr 2010 14:48:32 -0600
> From: Ken Nielson <knielson at adaptivecomputing.com>
> Subject: Re: [torqueusers] Torque configuration for single node
>        -single job
> To: torqueusers at supercluster.org
> Message-ID: <4BD9F0A0.5030406 at adaptivecomputing.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> On 04/29/2010 10:46 AM, alap pandya wrote:
> >
> > Hi,
> >
> > How can we avoid node sharing by multiple job in torque (i.e. we do
> > not want multiple jobs to run on same node at same time). Please let
> > me know what all configuration changes are required and how to do them.
> >
> > With Regards,
> > Alap
> >
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> There are several ways to have a node run multiple jobs. Probably the
> easiest thing to ask is what do you have in your
> $TORQUEHOME/server_priv/nodes file.
>
> Ken Nielson
> Adaptive Computing
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://www.supercluster.org/pipermail/torqueusers/attachments/20100429/1b974ee2/attachment-0001.html
>
> ------------------------------
>
> Message: 2
> Date: Thu, 29 Apr 2010 16:35:19 -0500
> From: "Coyle, James J [ITACD]" <jjc at iastate.edu>
> Subject: Re: [torqueusers] Torque configuration for single node
>        -single job
> To: Torque Users Mailing List <torqueusers at supercluster.org>
> Message-ID:
>        <
> D1D950C0853848438D74D2EB6EED082A885AD72511 at EXITS711.its.iastate.edu>
> Content-Type: text/plain; charset="us-ascii"
>
> Alap,
>
>   Here are two suggestions for the case of pbs_sched (Maui  and MOAB may
> have more sophisticated mechanisms)
> The first is for any user, the second can only be implemented by the admin,
> and does not do exactly what you want,
> but is automatic.
>
>
> 1)      For user:
> -----------------------------------
> I am going to assume that all the node are of type cluster (not
> time-shared) you c an check this with the command:
> pbsnodes -a | grep ntype
>
> all lines should look like:
>
>   ntype = cluster
>
>
> Assuming that the file /var/spool/torque/server_priv/nodes has lines like:
>
> node001  ppn=4
> node002  ppn=4
>
> where they are all 4, then andy user can get nodes to himself/herself  just
> by reserving full node, e.g. for the
> above ppn=4 submit with
>
> -lnodes=1:ppn=4
>
> Even if you only use one or two processors.  Wasteful, yes, but it works,.
> You have reserved the entire node so no other jobs can run on this node.
>  (You will likely get charged for 4 nodes also if charging is done.)
>
> If you do not have access to /var/spool/torque/server_priv/nodes , then
> issue:
> pbsnodes -a | grep np
>
>
> and hopefully you see the same number after np =
> e.g.
> np = 4
>
> as for the case above. (This is not a typo,  it is np= in pbsnodes -a   and
> ppn=  in the nodes file.)
>
>
> 2)      For manager
>
> -----------------------
>
>  If you are a manager for the cluster, you can issue
>
> qmgr -c  'set server node_pack = False'
>
>  This will attempt to always start a new job on an empty node, so if there
> are free node, the jobs will spread out.
> This will not prevent jobs from sharing a node, but will delay it.
> I don't use this nor recommend it unless you are running the cluster like a
> farm, that is all the jobs are single processor
> Jobs and you want to spread the load as much as possible.
>
> If you are trying to run multi-processor jobs, it is best to pack them so
> that there are lots of fuly free nodes.
>
>
>  James Coyle, PhD
>  High Performance Computing Group
>  115 Durham Center
>  Iowa State Univ.
>  Ames, Iowa 50011           web: http://www.public.iastate.edu/~jjc<http://www.public.iastate.edu/%7Ejjc>
>
>
> From: torqueusers-bounces at supercluster.org [mailto:
> torqueusers-bounces at supercluster.org] On Behalf Of alap pandya
> Sent: Thursday, April 29, 2010 11:46 AM
> To: torqueusers at supercluster.org
> Subject: [torqueusers] Torque configuration for single node -single job
>
>
> Hi,
>
> How can we avoid node sharing by multiple job in torque (i.e. we do not
> want multiple jobs to run on same node at same time). Please let me know
> what all configuration changes are required and how to do them.
>
> With Regards,
> Alap
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://www.supercluster.org/pipermail/torqueusers/attachments/20100429/9326ce7b/attachment-0001.html
>
> ------------------------------
>
> Message: 3
> Date: Thu, 29 Apr 2010 16:07:27 -0700
> From: Garrick Staples <garrick at usc.edu>
> Subject: Re: [torqueusers] Question about       the     difference
>  between a
>        node    where pbs_server is run and a   compute node
> To: Torque Users Mailing List <torqueusers at supercluster.org>
> Message-ID: <20100429230727.GF18981 at polop.usc.edu>
> Content-Type: text/plain; charset="us-ascii"
>
> On Thu, Apr 29, 2010 at 09:26:06AM +0200, Bas van der Vlies alleged:
> >
> > On 28 apr 2010, at 20:37, Garrick Staples wrote:
> >
> > > On Wed, Apr 28, 2010 at 08:05:08PM +0200, Bas van der Vlies alleged:
> > >> Just a question is there switch in the configure to switch back to the
> old pbs_iff behaviour?
> > >
> > > What old pbs_iff behaviour? The unix domain socket code has been there
> since the 2.1.x days.
> > >
> >
> > Garrick can you explain why our 2.1.11 pbs utilities use the 'pbs_iff'
> interface to communicate with the pbs_server if they run on the node where
> the pbs_server is started?  We do not have any problems because a child is
> created and pbs_server can accept connections again. So in this installation
> > the /tmp/.torque-unix is not used at all or has it a different name?
> >
>
> I can't say that I know what is going on over there.
>
>
> > When we run the same utitlies on a 2.4.7 installation the
> /tmp/.torque-unix is used and no child created.  The problem might be that
> the server  only handles one connection when /tmp/.torque-unix is used. So
> when i do i pbs_connect() an let it linger it will eventually timeout, but
> the pbs_server does not accept connections anymore till the timeout.
> >
> > That is why i asked if we can use the pbs_iff interface on the pbs_server
> again!!!
>
> ./configure --disable-unixsockets
>
> Note that, what I wrote it, the unix socket support was a huge performance
> boost and didn't suck up lots of privileged ports. But I can't comment on
> what
> happened to it in the 2.4.x branch.
>
>
> > To trigger is it easy. Just use pbs_connect() and do not close it. We
> have tested it on:
> >   - debian lenny
> >   - centos 5
>
> wait... I thought you were having a problem with the basic stuff like
> qstat? Those always immediately exit.
>
> I may have been misunderstanding the problem all along.
>
>
> > -------------------------------
> > If Found the problem on the pbs_server:
> >   - /var/spool/torque/server_name
> >
> > If this contains a name that is in /etc/hosts it uses the
> /tmp/.torque-unix mechanism that causes the problem. If is defined a name
> that must be 'resolved' other then /etc/hosts it will use the pbs_iff
> interface,  this has no problem because a child process is created.
> >
> > So the temporary solution is to use a name that must be resolved by DNS.
>
> No, it has nothing to do with DNS. Torque has no idea how a name is found.
> The
> lower-level system libs do that.
>
> If you look at the client lib code, there is a comparison after the name
> lookup
> against localhost and the server name.
>
> src/lib/Libifl/pbsD_connect.c:
> #ifdef ENABLE_UNIX_SOCKETS
>  /* determine if we want to use unix domain socket */
>
>  if (!strcmp(server, "localhost"))
>    use_unixsock = 1;
>  else if ((gethostname(hnamebuf, sizeof(hnamebuf) - 1) == 0) &&
> !strcmp(hnamebuf, server))
>    use_unixsock = 1;
>
>
>
> > The question is can the unix domain socket handle more the one
> connection?
>
> It certainly should. It is just a different transport layer. This is the
> first
> time I've heard a complaint.
>
>
> --
> Garrick Staples, GNU/Linux HPCC SysAdmin
> University of Southern California
>
> Life is Good!
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: application/pgp-signature
> Size: 189 bytes
> Desc: not available
> Url :
> http://www.supercluster.org/pipermail/torqueusers/attachments/20100429/2db1a8e1/attachment.bin
>
> ------------------------------
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> End of torqueusers Digest, Vol 69, Issue 28
> *******************************************
>



-- 
With Regards,
Alap Pandya
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100430/21bedb7f/attachment.html 


More information about the torqueusers mailing list