[torqueusers] PBS Error: Execution server rejected request

notinh notien notinhnotien7 at hotmail.com
Thu Nov 3 15:50:38 MST 2005


Hi, all. I have just added another 4 nodes to our small cluster but one of 
the nodes just refused to work and I could not firgure out what wrong with 
it.  Could someone take a look at what I describe here and help me to 
firgure out the problems.

These 4 new nodes were cloned from the same exact images so only hostnames 
and IPs are different.
The cluster has torque 1.1.0p4 and maui 3.2.6p9 on RH Linux 9.1 with kernel 
2.4.20-31smp.

The bad node and the head node can ping each other by hostnames.  There is 
no firewall in the head and computing nodes because they are in private 
protected network.  When the bad node's MOM started, the head node received 
Hello from it.

[root at node14 mom_priv]# momctl -d 3 -h 10.0.1.250
simpleget: Premature end of message
ERROR:    query[0] 'diag' failed on 10.0.1.250 (errno: 0:5)
startcom: diswsi error Protocol failure in commit
[root at node14 mom_priv]# momctl -d 3 -h master
simpleget: Premature end of message
ERROR:    query[0] 'diag' failed on master (errno: 0:5)
startcom: diswsi error Protocol failure in commit

When I submitted a job to this bad node, the job got queued and stucked 
there forever.  Doing qstat -f yielded comment:
comment = Not Running - PBS Error: Execution server rejected request
and substate = 10

The head node's server log gave:

11/03/2005 14:59:58;0004;PBS_Server;Svr;WARNING;!!! unable to contact node 
node14 !!!
11/03/2005 15:00:00;0008;PBS_Server;Job;8197.master.stellar.com;unable to 
run job, MOM rejected
11/03/2005 15:00:00;0008;PBS_Server;Job;8197.master.stellar.com;Job Modified 
at request of Scheduler at master.stellar.com
11/03/2005 15:00:00;0040;PBS_Server;Svr;master.stellar.com;Scheduler sent 
command recyc
11/03/2005 15:01:00;0040;PBS_Server;Svr;master.stellar.com;Scheduler sent 
command time
11/03/2005 15:01:00;0008;PBS_Server;Job;8197.master.stellar.com;Job Modified 
at request of Scheduler at master.stellar.com
11/03/2005 15:01:00;0008;PBS_Server;Job;8197.master.stellar.com;Job Run at 
request of Scheduler at master.stellar.com

The bad node's MOM log file show nothing.  I did tried to configure it with 
$loglevel 7 but when MOM started it complained 
(pbs_mom;Svr;pbs_mom;read_config, special command name loglevel not found 
(ignoring line)).

The /etc/hosts and /etc/hosts.equiv listed all the nodes in the cluster.  
The nodes file in server_private listed all the nodes too.

While the other three new nodes are happily running, this bad node just does 
not work at all.

Please suggest how to fix this.
Thank you in advance.

_________________________________________________________________
Don't just search. Find. Check out the new MSN Search! 
http://search.msn.click-url.com/go/onm00200636ave/direct/01/



More information about the torqueusers mailing list