[torqueusers] Strange problem in Torque and Maui

James J Coyle jjc at iastate.edu
Tue Oct 23 15:47:11 MDT 2007


Kandy,

   Have you logged into the server and checked that
hostname
and
cat /var/spool/torque/server_name  
(or cat /var/spool/pbs/server_name  if you are using pbs rather than torque)

both return the same thing.

>From your decription it looks like the logs are tellig you that
the contents of /var/spool/pbs/server_name
is just the string 'server name'
not the name or the ip address.

  I don't know how this got changed, you might check your
initialization script in e.g. /etc/rc3.d

> Hi Everyone,
> 
> I'm using torque-2.1.8 with maui-3.2.6p17.
> The system worked fine before but days ago when I tried to submit a job, 
> the job never gets run even though I'm sure all the nodes are 
> available.  But I still could force the job to run using 'qrun'.
> The strange things are:
> when I try to 'showq', it shows
> 0 of    0 Processors Active (0.00%)
> 
> I can use the command 'pbsnodes -a', qstat' and 'qmgr' on the nodes.  
> But not on the server. The followings are the output on the server:
> pbsnodes -a
> No default server name.
> pbsnodes: cannot connect to server , error=15034
> 
> qstat:
> No default server name.
> qstat: cannot connect to server (null) (errno=15034)
> 
> qmgr
> No default server name.
> qmgr: cannot connect to server
> 
> Also, when I use the 'xpbs' and 'xpbsmon' commands on the nodes, it 
> shows all the correction information like the server name and queues.
> But when I tried it on the server, it complains about 'No 
> Permission.\nxpbs_datadump: Can not connect to server (15007)'
> 
> So I look at the /var/spool/pbs/server_logs:
> PBS_Server;Svr;WARNING;ALERT: unable to contact node server name
> PBS_Server;Svr;PBS_Server;Connection refused (111) in contact_sched, 
> Could not contact Scheduler - port 15004
> 
> And in the log /var/spool/maui/
> ERROR:    cannot connect to PBS server 'server name'  rc: -1 (errno: 15007)
> ALERT:    cannot re-initialize PBS interface
> 10/23 11:12:02 ALERT:    cannot load cluster resources on RM (RM '0' 
> failed in function 'clusterquery')
> 10/23 11:12:02 WARNING:  no resources detected
> 
> The only thing I can think of is that the server got reset a week ago 
> but I'm sure all the pbs_server, pbs_mom and maui services are back on.  
> 'server_name' is in /var/spool/pbs
> Any ideas?
> Thank you very much.
> 
> Kandy
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 



> Hi Everyone,
> 
> I'm using torque-2.1.8 with maui-3.2.6p17.
> The system worked fine before but days ago when I tried to submit a job, 
> the job never gets run even though I'm sure all the nodes are 
> available.  But I still could force the job to run using 'qrun'.
> The strange things are:
> when I try to 'showq', it shows
> 0 of    0 Processors Active (0.00%)
> 
> I can use the command 'pbsnodes -a', qstat' and 'qmgr' on the nodes.  
> But not on the server. The followings are the output on the server:
> pbsnodes -a
> No default server name.
> pbsnodes: cannot connect to server , error=15034
> 
> qstat:
> No default server name.
> qstat: cannot connect to server (null) (errno=15034)
> 
> qmgr
> No default server name.
> qmgr: cannot connect to server
> 
> Also, when I use the 'xpbs' and 'xpbsmon' commands on the nodes, it 
> shows all the correction information like the server name and queues.
> But when I tried it on the server, it complains about 'No 
> Permission.\nxpbs_datadump: Can not connect to server (15007)'
> 
> So I look at the /var/spool/pbs/server_logs:
> PBS_Server;Svr;WARNING;ALERT: unable to contact node server name
> PBS_Server;Svr;PBS_Server;Connection refused (111) in contact_sched, 
> Could not contact Scheduler - port 15004
> 
> And in the log /var/spool/maui/
> ERROR:    cannot connect to PBS server 'server name'  rc: -1 (errno: 15007)
> ALERT:    cannot re-initialize PBS interface
> 10/23 11:12:02 ALERT:    cannot load cluster resources on RM (RM '0' 
> failed in function 'clusterquery')
> 10/23 11:12:02 WARNING:  no resources detected
> 
> The only thing I can think of is that the server got reset a week ago 
> but I'm sure all the pbs_server, pbs_mom and maui services are back on.  
> 'server_name' is in /var/spool/pbs
> Any ideas?
> Thank you very much.
> 
> Kandy
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 

-- 
 James Coyle, PhD
 SGI Origin, Alpha, Xeon and Opteron Cluster Manager
 High Performance Computing Group     
 235 Durham Center            
 Iowa State Univ.           phone: (515)-294-2099
 Ames, Iowa 50011           web: http://jjc.public.iastate.edu




More information about the torqueusers mailing list