[torqueusers] Strange problem in Torque and Maui

Gordon Golding gordon.golding at colorado.edu
Tue Oct 23 16:20:40 MDT 2007


Looks very much like this:

Cannot connect to server: error=15034

This error occurs in TORQUE clients (or their APIs) because TORQUE cannot
find the "server_name" file and/or the PBS_DEFAULT environment variable is
not set. The "server_name" file or PBS_DEFAULT variable indicate the
pbs_server's hostname that the client tools should communicate with. The
"server_name" file is usually located in TORQUE's local state directory.
Make sure the file exists, has proper permissions, and that the version of
TORQUE you are running was built with the proper directory settings.
Alternatively you can set the PBS_DEFAULT environment variable. Restart
TORQUE daemons if you make changes to these settings.

-----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Kandy Wong
Sent: Tuesday, October 23, 2007 3:32 PM
To: torqueusers at supercluster.org; mauiusers at supercluster.org
Subject: [torqueusers] Strange problem in Torque and Maui

Hi Everyone,

I'm using torque-2.1.8 with maui-3.2.6p17.
The system worked fine before but days ago when I tried to submit a job, 
the job never gets run even though I'm sure all the nodes are 
available.  But I still could force the job to run using 'qrun'.
The strange things are:
when I try to 'showq', it shows
0 of    0 Processors Active (0.00%)

I can use the command 'pbsnodes -a', qstat' and 'qmgr' on the nodes.  
But not on the server. The followings are the output on the server:
pbsnodes -a
No default server name.
pbsnodes: cannot connect to server , error=15034

qstat:
No default server name.
qstat: cannot connect to server (null) (errno=15034)

qmgr
No default server name.
qmgr: cannot connect to server

Also, when I use the 'xpbs' and 'xpbsmon' commands on the nodes, it 
shows all the correction information like the server name and queues.
But when I tried it on the server, it complains about 'No 
Permission.\nxpbs_datadump: Can not connect to server (15007)'

So I look at the /var/spool/pbs/server_logs:
PBS_Server;Svr;WARNING;ALERT: unable to contact node server name
PBS_Server;Svr;PBS_Server;Connection refused (111) in contact_sched, 
Could not contact Scheduler - port 15004

And in the log /var/spool/maui/
ERROR:    cannot connect to PBS server 'server name'  rc: -1 (errno: 15007)
ALERT:    cannot re-initialize PBS interface
10/23 11:12:02 ALERT:    cannot load cluster resources on RM (RM '0' 
failed in function 'clusterquery')
10/23 11:12:02 WARNING:  no resources detected

The only thing I can think of is that the server got reset a week ago 
but I'm sure all the pbs_server, pbs_mom and maui services are back on.  
'server_name' is in /var/spool/pbs
Any ideas?
Thank you very much.

Kandy

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list