[torqueusers] Jobs don't run unless forced w/ qrun and then only to
first node... 15004 error.
Michael T. Colee
mtc at icess.ucsb.edu
Thu Oct 28 15:19:28 MDT 2004
Hello,
I've tried to do my homework yet can not figure this out.
I've installed torque-1.1.0p3 on a small FC2 cluster (31 nodes and 1
head node). My needs are simple so I'm not planning on installing Maui
and just using the default fifo scheduler (pbs_sched). Master and
compute nodes are on a dedicated switch and the firewall is off on that
NIC. rsh/rlogin works in both directions between nodes and master node.
A simple batch que serviced on a first-come-first-served basis is all I
need. I've had trouble finding documentation on running pbs_sched, is
there a cheet-sheet for it similar to the torque quick start guide?
My problem is that jobs submitted via qsub remain in the que and are not
run unless forced w/ a "qrun jobnumber" as root on the head node.
Further, if multiple jobs are in the que and all forced w/ qrun they are
all sent to the first node.
Here's what I know:
I followed the basic quick start instructions. I think the problem is
related to pbs_sched port access based the following line in
/usr/spool/PBS/server_logs/:
10/25/2004 15:06:18;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in contact_sched, Could not contact Scheduler - port 15004
I have pbs_mom installed and running on each of the nodes but not on the
head node and my /usr/spool/PBS/server_priv/nodes file lists each node.
My head node and default que are both "fc" (fuster-cluck).
I start pbs_sched and pbs_server by calling them from the command line
currently but will add them to /etc/rc.local when I get things working.
On the nodes I have /usr/spool/PBS/mom_priv/config:
[root at node10 mom_priv]# cat config
$clienthost 10.10.10.1
$logevent 255
$restricted 10.10.10.1
$usecp fc:/home /home
[root at fc ]# pbsnodes -a
node01
state = free
np = 1
ntype = cluster
status = arch=linux,uname=Linux node01 2.6.5-1.358 #1 Sat May 8
09:04:50 EDT 2004 i686,sessions=? 0,nsessions=?
0,nusers=0,idletime=1657962,totmem=1759428kb,availmem=1427120kb,physmem=972252kb,ncpus=1,loadave=0.00,netload=4297855,rectime=1098744239
.
. snip 02-30
.
node31
state = free
np = 1
ntype = cluster
status = arch=linux,uname=Linux node31 2.6.5-1.358 #1 Sat May 8
09:04:50 EDT 2004 i686,sessions=? 0,nsessions=?
0,nusers=0,idletime=1134489,totmem=1759424kb,availmem=1566460kb,physmem=972248kb,ncpus=1,loadave=0.00,netload=228540120,rectime=1098744249
[root at fc ]# qmgr -c "print server"
#
# Create queues and set their attributes.
#
#
# Create and define queue fc
#
create queue fc
set queue fc queue_type = Execution
set queue fc resources_default.nodes = 1
set queue fc resources_default.walltime = 01:00:00
set queue fc enabled = True
set queue fc started = True
#
# Set server attributes.
#
set server scheduling = True
set server default_queue = fc
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_ping_rate = 300
set server node_check_rate = 600
set server tcp_timeout = 6
[root at fc ]#
As a standard user I submit a test job which gets stuck in the que:
100mtc at fc[mtc] qsub /home/mtc/werk/fc/test
101mtc at fc[mtc] qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
7.fc test mtc 0 Q fc
102mtc at fc[mtc] qstat -f
Job Id: 7.fc
Job_Name = test
Job_Owner = mtc at fc
job_state = Q
queue = fc
server = fc
Checkpoint = u
ctime = Mon Oct 25 15:26:33 2004
Error_Path = fc:/home/mtc/werk/fc/test.e7
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Mon Oct 25 15:26:33 2004
Output_Path = fc:/home/mtc/werk/fc/test.o7
Priority = 0
qtime = Mon Oct 25 15:26:33 2004
Rerunable = True
Resource_List.nodes = 1
Resource_List.walltime = 01:00:00
Variable_List = PBS_O_HOME=/home/mtc,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=mtc,
PBS_O_PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/b
in:/home/mtc/bin:/home/mtc/bin/sh:/home/mtc/bin/perl:/home/mtc/bin/mrsn
therm:/usr/sbin:/home/mtc/bin/i686:/home/mtc/bin,
PBS_O_MAIL=/var/spool/mail/mtc,PBS_O_SHELL=bash,PBS_O_HOST=fc,
PBS_O_WORKDIR=/home/mtc/werk/fc,PBS_O_QUEUE=fc
etime = Mon Oct 25 15:26:33 2004
-------------------Most recent excerpt from /usr/spool/PBS/server_logs:
10/25/2004 15:06:18;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in contact_sched, Could not contact Scheduler - port 15004
10/25/2004 15:06:49;0100;PBS_Server;Req;;Type authenticateuser request
received from root at fc, sock=10
10/25/2004 15:06:49;0100;PBS_Server;Req;;Type statusjob request received
from root at fc, sock=9
10/25/2004 15:16:18;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in contact_sched, Could not contact Scheduler - port 15004
10/25/2004 15:26:18;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in contact_sched, Could not contact Scheduler - port 15004
10/25/2004 15:26:33;0100;PBS_Server;Req;;Type authenticateuser request
received from mtc at fc, sock=10
10/25/2004 15:26:33;0100;PBS_Server;Req;;Type queuejob request received
from mtc at fc, sock=9
10/25/2004 15:26:33;0100;PBS_Server;Req;;Type jobscript request received
from mtc at fc, sock=9
10/25/2004 15:26:33;0100;PBS_Server;Req;;Type readytocommit request
received from mtc at fc, sock=9
10/25/2004 15:26:33;0100;PBS_Server;Req;;Type commit request received
from mtc at fc, sock=9
10/25/2004 15:26:33;0100;PBS_Server;Job;7.fc;enqueuing into fc, state 1
hop 1
10/25/2004 15:26:33;0008;PBS_Server;Job;7.fc;Job Queued at request of
mtc at fc, owner = mtc at fc, job name = test, queue = fc
10/25/2004 15:26:33;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in contact_sched, Could not contact Scheduler - port 15004
10/25/2004 15:26:35;0100;PBS_Server;Req;;Type authenticateuser request
received from mtc at fc, sock=10
10/25/2004 15:26:35;0100;PBS_Server;Req;;Type statusjob request received
from mtc at fc, sock=9
I tried shutting down the server and scheduler, reinstalled pbs_mom on
all the nodes and still get the same 15004 error:
10/28/2004 13:54:01;0086;PBS_Server;Svr;PBS_Server;Shutdown request from
root at fc
10/28/2004 13:54:01;0086;PBS_Server;Svr;PBS_Server;Starting to shutdown
the server, type is Quick
10/28/2004 13:54:15;0002;PBS_Server;Svr;PBS_Server;Server shutdown completed
10/28/2004 13:54:15;0002;PBS_Server;Svr;Log;Log closed
10/28/2004 14:02:58;0002;PBS_Server;Svr;Log;Log opened
10/28/2004 14:02:58;0006;PBS_Server;Svr;PBS_Server;Server fc started,
initialization type = 1
10/28/2004 14:02:58;0002;PBS_Server;Svr;Act;Account file
/usr/spool/PBS/server_priv/accounting/20041028 opened
10/28/2004 14:02:58;0040;PBS_Server;Req;setup_nodes;setup_nodes()
10/28/2004 14:02:58;0086;PBS_Server;Svr;PBS_Server;Recovered queue fc
10/28/2004 14:02:58;0002;PBS_Server;Svr;PBS_Server;Expected 1, recovered
1 queues
10/28/2004 14:02:58;0100;PBS_Server;Job;7.fc;enqueuing into fc, state 1
hop 1
10/28/2004 14:02:58;0086;PBS_Server;Job;7.fc;Requeueing job, substate:
10 Requeued in queue: fc
10/28/2004 14:02:58;0002;PBS_Server;Svr;PBS_Server;Expected 1, recovered
1 jobs
10/28/2004 14:02:58;0006;PBS_Server;Svr;PBS_Server;Using ports
Server:15001 Scheduler:15004 MOM:15002
10/28/2004 14:02:58;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 8526
10/28/2004 14:02:58;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in contact_sched, Could not contact Scheduler - port 15004
10/28/2004 14:02:59;0100;PBS_Server;Req;;Type authenticateuser request
received from root at fc, sock=10
10/28/2004 14:02:59;0100;PBS_Server;Req;;Type statusjob request received
from root at fc, sock=9
As root I can force the job to run and it does so correctly (output and
error logs are as expected) other than that only the first node is ever
used.
Thanks in advance,
mtc
More information about the torqueusers
mailing list