[torqueusers] Strange problem in Torque and Maui

Vinod KV vinodkv at yahoo-inc.com
Wed Oct 24 01:04:33 MDT 2007


James J Coyle wrote:
> Kandy,
>
>    Have you logged into the server and checked that
> hostname
> and
> cat /var/spool/torque/server_name  
> (or cat /var/spool/pbs/server_name  if you are using pbs rather than torque)
>
> both return the same thing.
>
> >From your decription it looks like the logs are tellig you that
> the contents of /var/spool/pbs/server_name
> is just the string 'server name'
> not the name or the ip address.
>
>   I don't know how this got changed, you might check your
> initialization script in e.g. /etc/rc3.d
>
>   
>> Hi Everyone,
>>
>> I'm using torque-2.1.8 with maui-3.2.6p17.
>> The system worked fine before but days ago when I tried to submit a job, 
>> the job never gets run even though I'm sure all the nodes are 
>> available.  But I still could force the job to run using 'qrun'.
>> The strange things are:
>> when I try to 'showq', it shows
>> 0 of    0 Processors Active (0.00%)
>>
>> I can use the command 'pbsnodes -a', qstat' and 'qmgr' on the nodes.  
>> But not on the server. The followings are the output on the server:
>> pbsnodes -a
>> No default server name.
>> pbsnodes: cannot connect to server , error=15034
>>
>> qstat:
>> No default server name.
>> qstat: cannot connect to server (null) (errno=15034)
>>
>> qmgr
>> No default server name.
>> qmgr: cannot connect to server
>>
>> Also, when I use the 'xpbs' and 'xpbsmon' commands on the nodes, it 
>> shows all the correction information like the server name and queues.
>> But when I tried it on the server, it complains about 'No 
>> Permission.\nxpbs_datadump: Can not connect to server (15007)'
>>
>> So I look at the /var/spool/pbs/server_logs:
>> PBS_Server;Svr;WARNING;ALERT: unable to contact node server name
>> PBS_Server;Svr;PBS_Server;Connection refused (111) in contact_sched, 
>> Could not contact Scheduler - port 15004
>>
>> And in the log /var/spool/maui/
>> ERROR:    cannot connect to PBS server 'server name'  rc: -1 (errno: 15007)
>> ALERT:    cannot re-initialize PBS interface
>> 10/23 11:12:02 ALERT:    cannot load cluster resources on RM (RM '0' 
>> failed in function 'clusterquery')
>> 10/23 11:12:02 WARNING:  no resources detected
>>
>> The only thing I can think of is that the server got reset a week ago 
>> but I'm sure all the pbs_server, pbs_mom and maui services are back on.  
>> 'server_name' is in /var/spool/pbs
>> Any ideas?
>> Thank you very much.
>>
>> Kandy
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>     
>
>
>
>   
>> Hi Everyone,
>>
>> I'm using torque-2.1.8 with maui-3.2.6p17.
>> The system worked fine before but days ago when I tried to submit a job, 
>> the job never gets run even though I'm sure all the nodes are 
>> available.  But I still could force the job to run using 'qrun'.
>> The strange things are:
>> when I try to 'showq', it shows
>> 0 of    0 Processors Active (0.00%)
>>
>> I can use the command 'pbsnodes -a', qstat' and 'qmgr' on the nodes.  
>> But not on the server. The followings are the output on the server:
>> pbsnodes -a
>> No default server name.
>> pbsnodes: cannot connect to server , error=15034
>>
>> qstat:
>> No default server name.
>> qstat: cannot connect to server (null) (errno=15034)
>>
>> qmgr
>> No default server name.
>> qmgr: cannot connect to server
>>
>> Also, when I use the 'xpbs' and 'xpbsmon' commands on the nodes, it 
>> shows all the correction information like the server name and queues.
>> But when I tried it on the server, it complains about 'No 
>> Permission.\nxpbs_datadump: Can not connect to server (15007)'
>>
>> So I look at the /var/spool/pbs/server_logs:
>> PBS_Server;Svr;WARNING;ALERT: unable to contact node server name
>> PBS_Server;Svr;PBS_Server;Connection refused (111) in contact_sched, 
>> Could not contact Scheduler - port 15004
>>
>> And in the log /var/spool/maui/
>> ERROR:    cannot connect to PBS server 'server name'  rc: -1 (errno: 15007)
>> ALERT:    cannot re-initialize PBS interface
>> 10/23 11:12:02 ALERT:    cannot load cluster resources on RM (RM '0' 
>> failed in function 'clusterquery')
>> 10/23 11:12:02 WARNING:  no resources detected
>>
>> The only thing I can think of is that the server got reset a week ago 
>> but I'm sure all the pbs_server, pbs_mom and maui services are back on.  
>> 'server_name' is in /var/spool/pbs
>> Any ideas?
>> Thank you very much.
>>
>> Kandy
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>     
>
>   

This error occurs when none of server_name file and the environment 
variable PBS_DEFAULT point to pbs_server. Try setting up one of them.

--vinod


More information about the torqueusers mailing list