[torqueusers] PBS nodes state consistently going 'down'

Benjamin Leopold BLeopold at uni-muenster.de
Mon Aug 1 00:39:35 MDT 2011


Ooops, forgot that detail.
     v2.5.7


On 2011-07-29 16:11 , Ken Nielson wrote:
> What version of TORQUE are you using?
>
> Ken Nielson
> Adaptive Computing
>
> ----- Original Message -----
>> From: "B Leopold"<bleopold at uni-muenster.de>
>> To: "Torque Users Mailing List"<torqueusers at supercluster.org>
>> Sent: Friday, July 29, 2011 2:50:37 AM
>> Subject: [torqueusers] PBS nodes state consistently going 'down'
>> Good Morning All,
>> (I thank you all for all your previous help!)
>>
>>
>>
>> Currently, several Torque PBS nodes state is consistently going
>> 'down'.
>>
>> For reference:
>> - Of 5 nodes, only one of them usually has state=free
>>
>> - I can reset the state of each node to 'free' using a command like:
>> qmgr -a -c 'set node torquenode02 state=free'
>> But they only stay that way until they stop being used, and then
>> revert to 'down' again.
>>
>> - All of these nodes are on the same server; pre-cluster setup, using
>> the multiple names for the box in the /etc/hosts file, so it's not the
>> firewall port issue. i.e. Pinging each other is pinging themselves.
>>
>> - Only the first two (of 5) on the list of nodes from 'pbsnodes 'show
>> full set of node os/var/misc settings in the 'status' line. All others
>> do not have a status line at all.
>>
>> - pbs_mom is running, on the same box as pbs_server.
>>
>> - the mom_priv/config (with my comments for my own help :) file is:
>> $pbsserver ngs-anal # hostname running pbs_server
>> #$pbsserver 128.176.45.182 # host running pbs_server
>> $pbsclient torquenode* # machines which the mom daemon will trust to
>> run resource manager commands
>> $restricted ngs-anal,torquenode* # Specifies hosts which can be
>> trusted to access mom services as non-root.
>> $logevent 255 # a bitmap for event types to log
>> $loglevel 7 # verbosity of log (0 lowest, 7 highest)
>> $usecp *:/ / # Specifies which directories should be staged
>> $rcpcmd /usr/bin/scp -rvpB # specifies which command to use for
>> copying files
>> $job_output_file_umask 012 # umask of files created by job
>> $cputmult 1.0 # cpu time multiplier
>> $wallmult 1.0 # The factor is used for walltime calculations and
>> limits in the same way that cputmult is used for cpu time.
>> $max_load 1.0 # maximum processor load
>> $ideal_load 1.0 # ideal processor load
>> As shown above, I did attempt using the ip address as the pbsserver
>> var instead of the hostname, but got the same results.
>>
>> The server_priv/nodes file:
>> ngs-anal np=2
>> torquenode01 np=2
>> torquenode02 np=2
>> torquenode03 np=2
>> torquenode04 np=2
>>
>>
>> The pbsnodes command results:
>>> pbsnodes
>> ngs-anal
>> state = down
>> np = 2
>> ntype = cluster
>> status =
>> rectime=1311863128,varattr=,jobs=,state=free,netload=1494089052678,gres=sed:for
>> cpu time.,
>> loadave=0.05,ncpus=16,physmem=99204540kb,availmem=1239385736kb,totmem=1292469320kb,idletime=233,
>> nusers=6,nsessions=44,sessions=1433 21455 6292 6321 6322 6323 6324
>> 6355 6939 7323 7766 7916 12018
>> 12367 12460 12477 12481 12498 12509 12524 12539 12572 12768 25663
>> 28195 13594 14762 19902 19980
>> 19991 20149 21838 25672 25740 29174 29995 30000 30013 30017 30042
>> 30118 30127 30167 32701,
>> uname=Linux NGS-Anal 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC
>> 2011 x86_64,opsys=linux
>> gpus = 0
>>
>> torquenode01
>> state = free
>> np = 2
>> ntype = cluster
>> status =
>> rectime=1311863307,varattr=,jobs=,state=free,netload=1494089457297,gres=sed:for
>> cpu time.,
>> loadave=0.48,ncpus=16,physmem=99204540kb,availmem=1239385616kb,totmem=1292469320kb,idletime=416,
>> nusers=6,nsessions=44,sessions=1433 21455 6292 6321 6322 6323 6324
>> 6355 6939 7323 7766 7916 12018
>> 12367 12460 12477 12481 12498 12509 12524 12539 12572 12768 25663
>> 28195 13594 14762 19902 19980
>> 19991 20149 21838 25672 25740 29174 29995 30000 30013 30017 30042
>> 30118 30127 30167 32701,
>> uname=Linux NGS-Anal 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC
>> 2011 x86_64,opsys=linux
>> gpus = 0
>>
>> torquenode02
>> state = down
>> np = 2
>> ntype = cluster
>> gpus = 0
>>
>> torquenode03
>> state = down
>> (etc...)
>>
>>
>> Thanks for any further pointers on this!
>> -Benjamin-
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers

-- 
--------------------------------
Benjamin Leopold
BioInformatiker
benjamin.leopold at uni-muenster.de
Poliklinik für Parodontologie
Universitätsklinik Münster
Waldeyerstraße 30, 48149 Münster
T +49(0)251-83-49882
--------------------------------



More information about the torqueusers mailing list