[torqueusers] Jobs don't run unless forced w/ qrun and then only to first node... 15004 error.

Michael T. Colee mtc at icess.ucsb.edu
Thu Oct 28 15:19:28 MDT 2004


Hello,

I've tried to do my homework yet can not figure this out.

I've installed torque-1.1.0p3 on a small FC2 cluster (31 nodes and 1
head node).  My needs are simple so I'm not planning on installing Maui
and just using the default fifo scheduler (pbs_sched).  Master and 
compute nodes are on a dedicated switch and the firewall is off on that 
NIC.  rsh/rlogin works in both directions between nodes and master node.

A simple batch que serviced on a first-come-first-served basis is all I 
need.  I've had trouble finding documentation on running pbs_sched, is 
there a cheet-sheet for it similar to the torque quick start guide?

My problem is that jobs submitted via qsub remain in the que and are not
run unless forced w/ a "qrun jobnumber" as root on the head node.
Further, if multiple jobs are in the que and all forced w/ qrun they are
all sent to the first node.

Here's what I know:

I followed the basic quick start instructions.  I think the problem is
related to pbs_sched port access based the following line in
/usr/spool/PBS/server_logs/:

10/25/2004 15:06:18;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in contact_sched, Could not contact Scheduler - port 15004

I have pbs_mom installed and running on each of the nodes but not on the
head node and my /usr/spool/PBS/server_priv/nodes file lists each node.
My head node and default que are both "fc" (fuster-cluck).

I start pbs_sched and pbs_server by calling them from the command line
currently but will add them to /etc/rc.local when I get things working.

On the nodes I have /usr/spool/PBS/mom_priv/config:

[root at node10 mom_priv]# cat config
$clienthost     10.10.10.1
$logevent       255
$restricted     10.10.10.1
$usecp fc:/home /home


[root at fc ]# pbsnodes -a
node01
     state = free
     np = 1
     ntype = cluster
     status = arch=linux,uname=Linux node01 2.6.5-1.358 #1 Sat May 8
09:04:50 EDT 2004 i686,sessions=? 0,nsessions=?
0,nusers=0,idletime=1657962,totmem=1759428kb,availmem=1427120kb,physmem=972252kb,ncpus=1,loadave=0.00,netload=4297855,rectime=1098744239
.
.  snip 02-30
.
node31
     state = free
     np = 1
     ntype = cluster
     status = arch=linux,uname=Linux node31 2.6.5-1.358 #1 Sat May 8
09:04:50 EDT 2004 i686,sessions=? 0,nsessions=?
0,nusers=0,idletime=1134489,totmem=1759424kb,availmem=1566460kb,physmem=972248kb,ncpus=1,loadave=0.00,netload=228540120,rectime=1098744249


[root at fc ]# qmgr -c "print server"
#
# Create queues and set their attributes.
#
#
# Create and define queue fc
#
create queue fc
set queue fc queue_type = Execution
set queue fc resources_default.nodes = 1
set queue fc resources_default.walltime = 01:00:00
set queue fc enabled = True
set queue fc started = True
#
# Set server attributes.
#
set server scheduling = True
set server default_queue = fc
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_ping_rate = 300
set server node_check_rate = 600
set server tcp_timeout = 6
[root at fc ]#

As a standard user I submit a test job which gets stuck in the que:

100mtc at fc[mtc] qsub /home/mtc/werk/fc/test
101mtc at fc[mtc] qstat
Job id           Name             User             Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
7.fc             test             mtc                     0 Q fc

102mtc at fc[mtc] qstat -f
Job Id: 7.fc
    Job_Name = test
    Job_Owner = mtc at fc
    job_state = Q
    queue = fc
    server = fc
    Checkpoint = u
    ctime = Mon Oct 25 15:26:33 2004
    Error_Path = fc:/home/mtc/werk/fc/test.e7
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Mon Oct 25 15:26:33 2004
    Output_Path = fc:/home/mtc/werk/fc/test.o7
    Priority = 0
    qtime = Mon Oct 25 15:26:33 2004
    Rerunable = True
    Resource_List.nodes = 1
    Resource_List.walltime = 01:00:00
    Variable_List = PBS_O_HOME=/home/mtc,PBS_O_LANG=en_US.UTF-8,
        PBS_O_LOGNAME=mtc,

PBS_O_PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/b

in:/home/mtc/bin:/home/mtc/bin/sh:/home/mtc/bin/perl:/home/mtc/bin/mrsn
        therm:/usr/sbin:/home/mtc/bin/i686:/home/mtc/bin,
        PBS_O_MAIL=/var/spool/mail/mtc,PBS_O_SHELL=bash,PBS_O_HOST=fc,
        PBS_O_WORKDIR=/home/mtc/werk/fc,PBS_O_QUEUE=fc
    etime = Mon Oct 25 15:26:33 2004



-------------------Most recent excerpt from /usr/spool/PBS/server_logs:

10/25/2004 15:06:18;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in contact_sched, Could not contact Scheduler - port 15004
10/25/2004 15:06:49;0100;PBS_Server;Req;;Type authenticateuser request
received from root at fc, sock=10
10/25/2004 15:06:49;0100;PBS_Server;Req;;Type statusjob request received
from root at fc, sock=9
10/25/2004 15:16:18;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in contact_sched, Could not contact Scheduler - port 15004
10/25/2004 15:26:18;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in contact_sched, Could not contact Scheduler - port 15004
10/25/2004 15:26:33;0100;PBS_Server;Req;;Type authenticateuser request
received from mtc at fc, sock=10
10/25/2004 15:26:33;0100;PBS_Server;Req;;Type queuejob request received
from mtc at fc, sock=9
10/25/2004 15:26:33;0100;PBS_Server;Req;;Type jobscript request received
from mtc at fc, sock=9
10/25/2004 15:26:33;0100;PBS_Server;Req;;Type readytocommit request
received from mtc at fc, sock=9
10/25/2004 15:26:33;0100;PBS_Server;Req;;Type commit request received
from mtc at fc, sock=9
10/25/2004 15:26:33;0100;PBS_Server;Job;7.fc;enqueuing into fc, state 1
hop 1
10/25/2004 15:26:33;0008;PBS_Server;Job;7.fc;Job Queued at request of
mtc at fc, owner = mtc at fc, job name = test, queue = fc
10/25/2004 15:26:33;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in contact_sched, Could not contact Scheduler - port 15004
10/25/2004 15:26:35;0100;PBS_Server;Req;;Type authenticateuser request
received from mtc at fc, sock=10
10/25/2004 15:26:35;0100;PBS_Server;Req;;Type statusjob request received
from mtc at fc, sock=9



I tried shutting down the server and scheduler, reinstalled pbs_mom on 
all the nodes and still get the same 15004 error:

10/28/2004 13:54:01;0086;PBS_Server;Svr;PBS_Server;Shutdown request from 
root at fc
10/28/2004 13:54:01;0086;PBS_Server;Svr;PBS_Server;Starting to shutdown 
the server, type is Quick
10/28/2004 13:54:15;0002;PBS_Server;Svr;PBS_Server;Server shutdown completed
10/28/2004 13:54:15;0002;PBS_Server;Svr;Log;Log closed
10/28/2004 14:02:58;0002;PBS_Server;Svr;Log;Log opened
10/28/2004 14:02:58;0006;PBS_Server;Svr;PBS_Server;Server fc started, 
initialization type = 1
10/28/2004 14:02:58;0002;PBS_Server;Svr;Act;Account file 
/usr/spool/PBS/server_priv/accounting/20041028 opened
10/28/2004 14:02:58;0040;PBS_Server;Req;setup_nodes;setup_nodes()

10/28/2004 14:02:58;0086;PBS_Server;Svr;PBS_Server;Recovered queue fc
10/28/2004 14:02:58;0002;PBS_Server;Svr;PBS_Server;Expected 1, recovered 
1 queues
10/28/2004 14:02:58;0100;PBS_Server;Job;7.fc;enqueuing into fc, state 1 
hop 1
10/28/2004 14:02:58;0086;PBS_Server;Job;7.fc;Requeueing job, substate: 
10 Requeued in queue: fc
10/28/2004 14:02:58;0002;PBS_Server;Svr;PBS_Server;Expected 1, recovered 
1 jobs
10/28/2004 14:02:58;0006;PBS_Server;Svr;PBS_Server;Using ports 
Server:15001  Scheduler:15004  MOM:15002
10/28/2004 14:02:58;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 8526
10/28/2004 14:02:58;0001;PBS_Server;Svr;PBS_Server;Connection refused 
(111) in contact_sched, Could not contact Scheduler - port 15004
10/28/2004 14:02:59;0100;PBS_Server;Req;;Type authenticateuser request 
received from root at fc, sock=10
10/28/2004 14:02:59;0100;PBS_Server;Req;;Type statusjob request received 
from root at fc, sock=9


As root I can force the job to run and it does so correctly (output and
error logs are as expected) other than that only the first node is ever
used.

Thanks in advance,
mtc



More information about the torqueusers mailing list