[torqueusers] PBS nodes state consistently going 'down'

Benjamin Leopold BLeopold at uni-muenster.de
Mon Aug 1 09:44:25 MDT 2011


Questions are always good things!
I've attached the config.log to this message.

The options I set (to change from the defaults) were:
     --with-default-server
     --with-server-home

Thanks!
Benjamin


On 2011-08-01 16:53 , Ken Nielson wrote:
> Benjamin,
>
> I am sorry to just keep asking more questions. Could you tell me how how configured TORQUE. the config.log file will show all of the options.
>
> Regards
>
> Ken Nielson
> Adaptive Computing
>
> ----- Original Message -----
>> From: "Benjamin Leopold"<BLeopold at uni-muenster.de>
>> To: "Ken Nielson"<knielson at adaptivecomputing.com>
>> Cc: "Torque Users Mailing List"<torqueusers at supercluster.org>
>> Sent: Monday, August 1, 2011 12:39:35 AM
>> Subject: Re: [torqueusers] PBS nodes state consistently going 'down'
>> Ooops, forgot that detail.
>> v2.5.7
>>
>>
>> On 2011-07-29 16:11 , Ken Nielson wrote:
>>> What version of TORQUE are you using?
>>>
>>> Ken Nielson
>>> Adaptive Computing
>>>
>>> ----- Original Message -----
>>>> From: "B Leopold"<bleopold at uni-muenster.de>
>>>> To: "Torque Users Mailing List"<torqueusers at supercluster.org>
>>>> Sent: Friday, July 29, 2011 2:50:37 AM
>>>> Subject: [torqueusers] PBS nodes state consistently going 'down'
>>>> Good Morning All,
>>>> (I thank you all for all your previous help!)
>>>>
>>>>
>>>>
>>>> Currently, several Torque PBS nodes state is consistently going
>>>> 'down'.
>>>>
>>>> For reference:
>>>> - Of 5 nodes, only one of them usually has state=free
>>>>
>>>> - I can reset the state of each node to 'free' using a command
>>>> like:
>>>> qmgr -a -c 'set node torquenode02 state=free'
>>>> But they only stay that way until they stop being used, and then
>>>> revert to 'down' again.
>>>>
>>>> - All of these nodes are on the same server; pre-cluster setup,
>>>> using
>>>> the multiple names for the box in the /etc/hosts file, so it's not
>>>> the
>>>> firewall port issue. i.e. Pinging each other is pinging themselves.
>>>>
>>>> - Only the first two (of 5) on the list of nodes from 'pbsnodes
>>>> 'show
>>>> full set of node os/var/misc settings in the 'status' line. All
>>>> others
>>>> do not have a status line at all.
>>>>
>>>> - pbs_mom is running, on the same box as pbs_server.
>>>>
>>>> - the mom_priv/config (with my comments for my own help :) file is:
>>>> $pbsserver ngs-anal # hostname running pbs_server
>>>> #$pbsserver 128.176.45.182 # host running pbs_server
>>>> $pbsclient torquenode* # machines which the mom daemon will trust
>>>> to
>>>> run resource manager commands
>>>> $restricted ngs-anal,torquenode* # Specifies hosts which can be
>>>> trusted to access mom services as non-root.
>>>> $logevent 255 # a bitmap for event types to log
>>>> $loglevel 7 # verbosity of log (0 lowest, 7 highest)
>>>> $usecp *:/ / # Specifies which directories should be staged
>>>> $rcpcmd /usr/bin/scp -rvpB # specifies which command to use for
>>>> copying files
>>>> $job_output_file_umask 012 # umask of files created by job
>>>> $cputmult 1.0 # cpu time multiplier
>>>> $wallmult 1.0 # The factor is used for walltime calculations and
>>>> limits in the same way that cputmult is used for cpu time.
>>>> $max_load 1.0 # maximum processor load
>>>> $ideal_load 1.0 # ideal processor load
>>>> As shown above, I did attempt using the ip address as the pbsserver
>>>> var instead of the hostname, but got the same results.
>>>>
>>>> The server_priv/nodes file:
>>>> ngs-anal np=2
>>>> torquenode01 np=2
>>>> torquenode02 np=2
>>>> torquenode03 np=2
>>>> torquenode04 np=2
>>>>
>>>>
>>>> The pbsnodes command results:
>>>>> pbsnodes
>>>> ngs-anal
>>>> state = down
>>>> np = 2
>>>> ntype = cluster
>>>> status =
>>>> rectime=1311863128,varattr=,jobs=,state=free,netload=1494089052678,gres=sed:for
>>>> cpu time.,
>>>> loadave=0.05,ncpus=16,physmem=99204540kb,availmem=1239385736kb,totmem=1292469320kb,idletime=233,
>>>> nusers=6,nsessions=44,sessions=1433 21455 6292 6321 6322 6323 6324
>>>> 6355 6939 7323 7766 7916 12018
>>>> 12367 12460 12477 12481 12498 12509 12524 12539 12572 12768 25663
>>>> 28195 13594 14762 19902 19980
>>>> 19991 20149 21838 25672 25740 29174 29995 30000 30013 30017 30042
>>>> 30118 30127 30167 32701,
>>>> uname=Linux NGS-Anal 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC
>>>> 2011 x86_64,opsys=linux
>>>> gpus = 0
>>>>
>>>> torquenode01
>>>> state = free
>>>> np = 2
>>>> ntype = cluster
>>>> status =
>>>> rectime=1311863307,varattr=,jobs=,state=free,netload=1494089457297,gres=sed:for
>>>> cpu time.,
>>>> loadave=0.48,ncpus=16,physmem=99204540kb,availmem=1239385616kb,totmem=1292469320kb,idletime=416,
>>>> nusers=6,nsessions=44,sessions=1433 21455 6292 6321 6322 6323 6324
>>>> 6355 6939 7323 7766 7916 12018
>>>> 12367 12460 12477 12481 12498 12509 12524 12539 12572 12768 25663
>>>> 28195 13594 14762 19902 19980
>>>> 19991 20149 21838 25672 25740 29174 29995 30000 30013 30017 30042
>>>> 30118 30127 30167 32701,
>>>> uname=Linux NGS-Anal 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC
>>>> 2011 x86_64,opsys=linux
>>>> gpus = 0
>>>>
>>>> torquenode02
>>>> state = down
>>>> np = 2
>>>> ntype = cluster
>>>> gpus = 0
>>>>
>>>> torquenode03
>>>> state = down
>>>> (etc...)
>>>>
>>>>
>>>> Thanks for any further pointers on this!
>>>> -Benjamin-
>>>>

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: config.log
Url: http://www.supercluster.org/pipermail/torqueusers/attachments/20110801/a7b87670/attachment-0001.pl 


More information about the torqueusers mailing list