[torqueusers] PBS nodes state consistently going 'down'

Ken Nielson knielson at adaptivecomputing.com
Mon Aug 1 08:53:24 MDT 2011


Benjamin,

I am sorry to just keep asking more questions. Could you tell me how how configured TORQUE. the config.log file will show all of the options.

Regards

Ken Nielson
Adaptive Computing

----- Original Message -----
> From: "Benjamin Leopold" <BLeopold at uni-muenster.de>
> To: "Ken Nielson" <knielson at adaptivecomputing.com>
> Cc: "Torque Users Mailing List" <torqueusers at supercluster.org>
> Sent: Monday, August 1, 2011 12:39:35 AM
> Subject: Re: [torqueusers] PBS nodes state consistently going 'down'
> Ooops, forgot that detail.
> v2.5.7
> 
> 
> On 2011-07-29 16:11 , Ken Nielson wrote:
> > What version of TORQUE are you using?
> >
> > Ken Nielson
> > Adaptive Computing
> >
> > ----- Original Message -----
> >> From: "B Leopold"<bleopold at uni-muenster.de>
> >> To: "Torque Users Mailing List"<torqueusers at supercluster.org>
> >> Sent: Friday, July 29, 2011 2:50:37 AM
> >> Subject: [torqueusers] PBS nodes state consistently going 'down'
> >> Good Morning All,
> >> (I thank you all for all your previous help!)
> >>
> >>
> >>
> >> Currently, several Torque PBS nodes state is consistently going
> >> 'down'.
> >>
> >> For reference:
> >> - Of 5 nodes, only one of them usually has state=free
> >>
> >> - I can reset the state of each node to 'free' using a command
> >> like:
> >> qmgr -a -c 'set node torquenode02 state=free'
> >> But they only stay that way until they stop being used, and then
> >> revert to 'down' again.
> >>
> >> - All of these nodes are on the same server; pre-cluster setup,
> >> using
> >> the multiple names for the box in the /etc/hosts file, so it's not
> >> the
> >> firewall port issue. i.e. Pinging each other is pinging themselves.
> >>
> >> - Only the first two (of 5) on the list of nodes from 'pbsnodes
> >> 'show
> >> full set of node os/var/misc settings in the 'status' line. All
> >> others
> >> do not have a status line at all.
> >>
> >> - pbs_mom is running, on the same box as pbs_server.
> >>
> >> - the mom_priv/config (with my comments for my own help :) file is:
> >> $pbsserver ngs-anal # hostname running pbs_server
> >> #$pbsserver 128.176.45.182 # host running pbs_server
> >> $pbsclient torquenode* # machines which the mom daemon will trust
> >> to
> >> run resource manager commands
> >> $restricted ngs-anal,torquenode* # Specifies hosts which can be
> >> trusted to access mom services as non-root.
> >> $logevent 255 # a bitmap for event types to log
> >> $loglevel 7 # verbosity of log (0 lowest, 7 highest)
> >> $usecp *:/ / # Specifies which directories should be staged
> >> $rcpcmd /usr/bin/scp -rvpB # specifies which command to use for
> >> copying files
> >> $job_output_file_umask 012 # umask of files created by job
> >> $cputmult 1.0 # cpu time multiplier
> >> $wallmult 1.0 # The factor is used for walltime calculations and
> >> limits in the same way that cputmult is used for cpu time.
> >> $max_load 1.0 # maximum processor load
> >> $ideal_load 1.0 # ideal processor load
> >> As shown above, I did attempt using the ip address as the pbsserver
> >> var instead of the hostname, but got the same results.
> >>
> >> The server_priv/nodes file:
> >> ngs-anal np=2
> >> torquenode01 np=2
> >> torquenode02 np=2
> >> torquenode03 np=2
> >> torquenode04 np=2
> >>
> >>
> >> The pbsnodes command results:
> >>> pbsnodes
> >> ngs-anal
> >> state = down
> >> np = 2
> >> ntype = cluster
> >> status =
> >> rectime=1311863128,varattr=,jobs=,state=free,netload=1494089052678,gres=sed:for
> >> cpu time.,
> >> loadave=0.05,ncpus=16,physmem=99204540kb,availmem=1239385736kb,totmem=1292469320kb,idletime=233,
> >> nusers=6,nsessions=44,sessions=1433 21455 6292 6321 6322 6323 6324
> >> 6355 6939 7323 7766 7916 12018
> >> 12367 12460 12477 12481 12498 12509 12524 12539 12572 12768 25663
> >> 28195 13594 14762 19902 19980
> >> 19991 20149 21838 25672 25740 29174 29995 30000 30013 30017 30042
> >> 30118 30127 30167 32701,
> >> uname=Linux NGS-Anal 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC
> >> 2011 x86_64,opsys=linux
> >> gpus = 0
> >>
> >> torquenode01
> >> state = free
> >> np = 2
> >> ntype = cluster
> >> status =
> >> rectime=1311863307,varattr=,jobs=,state=free,netload=1494089457297,gres=sed:for
> >> cpu time.,
> >> loadave=0.48,ncpus=16,physmem=99204540kb,availmem=1239385616kb,totmem=1292469320kb,idletime=416,
> >> nusers=6,nsessions=44,sessions=1433 21455 6292 6321 6322 6323 6324
> >> 6355 6939 7323 7766 7916 12018
> >> 12367 12460 12477 12481 12498 12509 12524 12539 12572 12768 25663
> >> 28195 13594 14762 19902 19980
> >> 19991 20149 21838 25672 25740 29174 29995 30000 30013 30017 30042
> >> 30118 30127 30167 32701,
> >> uname=Linux NGS-Anal 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC
> >> 2011 x86_64,opsys=linux
> >> gpus = 0
> >>
> >> torquenode02
> >> state = down
> >> np = 2
> >> ntype = cluster
> >> gpus = 0
> >>
> >> torquenode03
> >> state = down
> >> (etc...)
> >>
> >>
> >> Thanks for any further pointers on this!
> >> -Benjamin-
> >>
> >> _______________________________________________
> >> torqueusers mailing list
> >> torqueusers at supercluster.org
> >> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> --
> --------------------------------
> Benjamin Leopold
> BioInformatiker
> benjamin.leopold at uni-muenster.de
> Poliklinik für Parodontologie
> Universitätsklinik Münster
> Waldeyerstraße 30, 48149 Münster
> T +49(0)251-83-49882
> --------------------------------


More information about the torqueusers mailing list