[torqueusers] PBS Error: Execution server rejected request
notinh notien
notinhnotien7 at hotmail.com
Thu Nov 3 15:50:38 MST 2005
Hi, all. I have just added another 4 nodes to our small cluster but one of
the nodes just refused to work and I could not firgure out what wrong with
it. Could someone take a look at what I describe here and help me to
firgure out the problems.
These 4 new nodes were cloned from the same exact images so only hostnames
and IPs are different.
The cluster has torque 1.1.0p4 and maui 3.2.6p9 on RH Linux 9.1 with kernel
2.4.20-31smp.
The bad node and the head node can ping each other by hostnames. There is
no firewall in the head and computing nodes because they are in private
protected network. When the bad node's MOM started, the head node received
Hello from it.
[root at node14 mom_priv]# momctl -d 3 -h 10.0.1.250
simpleget: Premature end of message
ERROR: query[0] 'diag' failed on 10.0.1.250 (errno: 0:5)
startcom: diswsi error Protocol failure in commit
[root at node14 mom_priv]# momctl -d 3 -h master
simpleget: Premature end of message
ERROR: query[0] 'diag' failed on master (errno: 0:5)
startcom: diswsi error Protocol failure in commit
When I submitted a job to this bad node, the job got queued and stucked
there forever. Doing qstat -f yielded comment:
comment = Not Running - PBS Error: Execution server rejected request
and substate = 10
The head node's server log gave:
11/03/2005 14:59:58;0004;PBS_Server;Svr;WARNING;!!! unable to contact node
node14 !!!
11/03/2005 15:00:00;0008;PBS_Server;Job;8197.master.stellar.com;unable to
run job, MOM rejected
11/03/2005 15:00:00;0008;PBS_Server;Job;8197.master.stellar.com;Job Modified
at request of Scheduler at master.stellar.com
11/03/2005 15:00:00;0040;PBS_Server;Svr;master.stellar.com;Scheduler sent
command recyc
11/03/2005 15:01:00;0040;PBS_Server;Svr;master.stellar.com;Scheduler sent
command time
11/03/2005 15:01:00;0008;PBS_Server;Job;8197.master.stellar.com;Job Modified
at request of Scheduler at master.stellar.com
11/03/2005 15:01:00;0008;PBS_Server;Job;8197.master.stellar.com;Job Run at
request of Scheduler at master.stellar.com
The bad node's MOM log file show nothing. I did tried to configure it with
$loglevel 7 but when MOM started it complained
(pbs_mom;Svr;pbs_mom;read_config, special command name loglevel not found
(ignoring line)).
The /etc/hosts and /etc/hosts.equiv listed all the nodes in the cluster.
The nodes file in server_private listed all the nodes too.
While the other three new nodes are happily running, this bad node just does
not work at all.
Please suggest how to fix this.
Thank you in advance.
_________________________________________________________________
Don't just search. Find. Check out the new MSN Search!
http://search.msn.click-url.com/go/onm00200636ave/direct/01/
More information about the torqueusers
mailing list