[torqueusers] torque requires --enable-debug to work
Andrew Sharpe
andrew.sharpe1 at jcu.edu.au
Tue Oct 11 21:50:35 MDT 2005
Hi all,
I'm fairly new to torque but I think this is non-standard behaviour.
I've verified that this problem exists on
torque-1.2.0p4
torque-1.2.0p6
torque-1.2.0p7-snap.1127772314
using CentOS4.1 (full install) on x86_64.
The problem is that torque only works if I compile it with
--enable-debug. Here's the steps I follow to obtain the results:
NOTE: pbs is a CNAME to machine1, which correctly resolves to 10.1.1.12,
machine2 correctly resolves to 10.1.1.13 - by resolves I mean forward
and reverse lookups are ok.
1. compile any of the above versions using the following commands
./configure --prefix=/usr/local --set-default-server=pbs
--set-server-home=/var/spool/PBS --enable-server --enable-docs
--enable-mom --enable-clients --enable-syslog --disable-rpp
make
make install
2. configure the server
a. add the client nodes to /var/spool/PBS/server_priv/nodes
b. set the default server in /var/spool/PBS/server_name
c. configure a queue (taken from
http://www.clusterresources.com/products/torque/docs/manualconfig.shtml)
pbs_server -t create
qmgr -c "set server scheduling=true"
qmgr -c "create queue batch queue_type=execution"
qmgr -c "set queue batch started=true"
qmgr -c "set queue batch enabled=true"
qmgr -c "set queue batch resources_default.nodes=1"
qmgr -c "set queue batch resources_default.walltime=3600"
qmgr -c "set server default_queue=batch"
d. stop the server
qterm -t quick
3. configure the client(s)
a. set the default server (for client commands) in
/var/spool/PBS/server_name
b. configure mom (in /var/spool/PBS/mom_priv/config
$logevent 255
$loglevel 4
$pbsserver pbs
$clienthost 10.1.1.13
$cputmult 1.0
$wallmult 1.0
$max_load 2.5
$ideal_load 2.0
4. start the server - note that I don't use a scheduler here, the
problem occurs whether using pbs_sched, maui or qrun
/usr/local/sbin/pbs_server
/usr/local/sbin/pbs_mom
5. start the client
/usr/local/sbin/pbs_mom
Now for some diags:
Both machines are using ntp, and here's the output of ntpq -p for each
machine
computer1
remote refid st t when poll reach delay offset
jitter
==============================================================================
*ntp LOCAL(0) 11 u 903 1024 377 0.105 -0.098
0.083
computer2
remote refid st t when poll reach delay offset
jitter
==============================================================================
*ntp LOCAL(0) 11 u 499 1024 377 0.116 0.283
0.008
On the server, run pbsnodes -a
computer1
state = free
np = 1
ntype = cluster
status = arch=linux,uname=Linux computer1 2.6.9-11.ELsmp #1 SMP Wed
Jun 8 16:59:12 CDT 2005 x86_64,sessions=1329 2261
26560,nsessions=3,nusers=1,idletime=2032,totme
m=5111944kb,availmem=4927996kb,physmem=4059696kb,ncpus=2,loadave=0.00,netload=17951093,state=free,rectime=1129082554
computer2
state = free
np = 1
ntype = cluster
status = arch=linux,uname=Linux computer2 2.6.9-11.ELsmp #1 SMP Wed
Jun 8 16:59:12 CDT 2005 x86_64,sessions=1269
1995,nsessions=2,nusers=1,idletime=2033,totmem=5111
944kb,availmem=4931696kb,physmem=4059696kb,ncpus=2,loadave=0.00,netload=17553961,state=free,rectime=1129082545
Now as a normal user on computer1, submit a job
echo "sleep 20" | qsub -r n
Check the queue
qstat
computer 1
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
0.computer1 STDIN hpc 0 Q
batch
qstat -an
computer 1
Req'd
Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time
S Time
--------------- -------- -------- ---------- ------ --- --- ------ -----
- -----
0.computer1 hpc batch STDIN -- -- -- -- 01:00
Q --
--
As we aren't using a scheduler at the moment (to limit the search
space), run the job as root on computer1
qrun -H computer2 0
qrun: Execution server rejected request MSG=send failed, STARTING
0.computer1
Here's another look at the queue after the above sequence of events
qstat -an
computer1
Req'd
Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time
S Time
--------------- -------- -------- ---------- ------ --- --- ------ -----
- -----
0.computer1 hpc batch STDIN -- -- -- -- 01:00
Q --
computer2/0
So the job appears to not be running, however the wierdest bit is that
the job actually does run about a minute later, so here's the logs of
pbs_server on computer1 and mom on computer2
pbs_server on computer1
==================
10/12/2005 13:29:50;0002;PBS_Server;Svr;Log;Log opened
10/12/2005 13:29:50;0006;PBS_Server;Svr;PBS_Server;Server r2cn09
started, initialization type = 1
10/12/2005 13:29:50;0002;PBS_Server;Svr;Act;Account file
/var/spool/PBS/server_priv/accounting/20051012 opened
10/12/2005 13:29:50;0040;PBS_Server;Req;setup_nodes;setup_nodes()
10/12/2005 13:29:50;0086;PBS_Server;Svr;PBS_Server;Recovered queue batch
10/12/2005 13:29:50;0002;PBS_Server;Svr;PBS_Server;Expected 1, recovered
1 queues
10/12/2005 13:29:50;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered
0 jobs
10/12/2005 13:29:50;0006;PBS_Server;Svr;PBS_Server;Using ports
Server:15001 Scheduler:15004 MOM:15002
10/12/2005 13:29:50;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid =
26626, loglevel=0
10/12/2005 13:29:50;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact
node r2cn09
10/12/2005 13:29:50;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in contact_sched, Could not contact Scheduler - port 15004
10/12/2005 13:29:54;0040;PBS_Server;Req;is_stat_get;node r2cn10 marked
available
10/12/2005 13:29:56;0100;PBS_Server;Req;;Type AuthenticateUser request
received from root at r2cn09, sock=11
10/12/2005 13:29:56;0100;PBS_Server;Req;;Type StatusNode request
received from root at r2cn09, sock=10
10/12/2005 13:30:00;0040;PBS_Server;Req;is_stat_get;node r2cn09 marked
available
10/12/2005 13:30:02;0100;PBS_Server;Req;;Type AuthenticateUser request
received from root at r2cn09, sock=11
10/12/2005 13:30:02;0100;PBS_Server;Req;;Type StatusNode request
received from root at r2cn09, sock=10
10/12/2005 13:30:05;0100;PBS_Server;Req;;Type AuthenticateUser request
received from hpc at r2cn09, sock=11
10/12/2005 13:30:05;0100;PBS_Server;Req;;Type QueueJob request received
from hpc at r2cn09, sock=10
10/12/2005 13:30:05;0100;PBS_Server;Req;;Type JobScript request received
from hpc at r2cn09, sock=10
10/12/2005 13:30:05;0100;PBS_Server;Req;;Type ReadyToCommit request
received from hpc at r2cn09, sock=10
10/12/2005 13:30:05;0100;PBS_Server;Req;;Type Commit request received
from hpc at r2cn09, sock=10
10/12/2005 13:30:05;0100;PBS_Server;Job;0.r2cn09;enqueuing into batch,
state 1 hop 1
10/12/2005 13:30:05;0008;PBS_Server;Job;0.r2cn09;Job Queued at request
of hpc at r2cn09, owner = hpc at r2cn09, job name = STDIN, queue = batch
10/12/2005 13:30:05;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in contact_sched, Could not contact Scheduler - port 15004
10/12/2005 13:30:07;0100;PBS_Server;Req;;Type AuthenticateUser request
received from hpc at r2cn09, sock=11
10/12/2005 13:30:07;0100;PBS_Server;Req;;Type StatusJob request received
from hpc at r2cn09, sock=10
10/12/2005 13:30:09;0100;PBS_Server;Req;;Type AuthenticateUser request
received from hpc at r2cn09, sock=11
10/12/2005 13:30:09;0100;PBS_Server;Req;;Type StatusServer request
received from hpc at r2cn09, sock=10
10/12/2005 13:30:09;0100;PBS_Server;Req;;Type StatusJob request received
from hpc at r2cn09, sock=10
10/12/2005 13:30:36;0100;PBS_Server;Req;;Type AuthenticateUser request
received from root at r2cn09, sock=11
10/12/2005 13:30:36;0100;PBS_Server;Req;;Type RunJob request received
from root at r2cn09, sock=10
10/12/2005 13:30:36;0008;PBS_Server;Job;0.r2cn09;Job Run at request of
root at r2cn09
10/12/2005 13:30:42;0001;PBS_Server;Svr;PBS_Server;Operation now in
progress (115) in send_job, send_job commit failed, rc=15031 (Premature
end of message)
10/12/2005 13:30:42;0008;PBS_Server;Job;0.r2cn09;unable to run job, MOM
rejected/rc=1
10/12/2005 13:30:42;0080;PBS_Server;Req;req_reject;Reject reply
code=15041(Execution server rejected request MSG=send failed, STARTING),
aux=0, type=RunJob, from root at r2
cn09
10/12/2005 13:30:42;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in contact_sched, Could not contact Scheduler - port 15004
10/12/2005 13:31:35;0100;PBS_Server;Req;;Type AuthenticateUser request
received from hpc at r2cn09, sock=11
10/12/2005 13:31:35;0100;PBS_Server;Req;;Type StatusServer request
received from hpc at r2cn09, sock=10
10/12/2005 13:31:35;0100;PBS_Server;Req;;Type StatusJob request received
from hpc at r2cn09, sock=10
10/12/2005 13:32:08;0100;PBS_Server;Req;;Type JobObituary request
received from pbs_mom at r2cn10.hpc.net, sock=10
10/12/2005 13:32:08;0009;PBS_Server;Job;0.r2cn09;obit received for job
0.r2cn09 from host r2cn10.hpc.net with bad state (state: QUEUED)
10/12/2005 13:32:08;0080;PBS_Server;Req;req_reject;Reject reply
code=15016(Request invalid for state of job), aux=0, type=JobObituary,
from pbs_mom at r2cn10.hpc.net
10/12/2005 13:32:31;0086;PBS_Server;Svr;PBS_Server;Starting to shutdown
the server, type is By Signal
10/12/2005 13:32:31;0002;PBS_Server;Svr;PBS_Server;Server shutdown completed
10/12/2005 13:32:31;0002;PBS_Server;Svr;Log;Log closed
mom on computer2
==============
10/12/2005 13:29:44;0002; pbs_mom;Svr;Log;Log opened
10/12/2005 13:29:44;0002; pbs_mom;Svr;setloglevel;4
10/12/2005 13:29:44;0002; pbs_mom;Svr;restricted;*.hpc.net
10/12/2005 13:29:44;0002; pbs_mom;Svr;cputmult;1.0
10/12/2005 13:29:44;0002; pbs_mom;Svr;wallmult;1.0
10/12/2005 13:29:44;0002; pbs_mom;Svr;max_load;2.5
10/12/2005 13:29:44;0080; pbs_mom;n/a;add_static;config[0] add name
max_load value 2.5
10/12/2005 13:29:44;0002; pbs_mom;Svr;ideal_load;2.0
10/12/2005 13:29:44;0080; pbs_mom;n/a;add_static;config[0] add name
ideal_load value 2.0
10/12/2005 13:29:44;0002; pbs_mom;n/a;initialize;independent
10/12/2005 13:29:44;0002; pbs_mom;Svr;pbs_mom;Is up
10/12/2005 13:29:44;0002; pbs_mom;n/a;is_update_stat;is_update_stat:
added connection to pbs
10/12/2005 13:29:44;0002; pbs_mom;n/a;is_update_stat;status update
successfully sent to server
10/12/2005 13:29:44;0002; pbs_mom;n/a;is_update_stat;hello sent to server
10/12/2005 13:29:44;0002; pbs_mom;n/a;is_update_stat;status update
successfully sent to server
10/12/2005 13:29:44;0002; pbs_mom;n/a;is_update_stat;hello sent to server
10/12/2005 13:29:44;0001; pbs_mom;Job;state_to_server;sent updated
state 0x0 to server
10/12/2005 13:29:54;0008; pbs_mom;Job;do_rpp;got an inter-server request
10/12/2005 13:29:54;0001; pbs_mom;Job;is_request;stream 0 version 1
10/12/2005 13:29:54;0001; pbs_mom;Job;is_request;command 2 received
10/12/2005 13:29:54;0001; pbs_mom;Job;is_request;IS_CLUSTER_ADDRS received
10/12/2005 13:29:54;0001; pbs_mom;Job;is_request;is_request:
10.1.1.34 added to okclients
10/12/2005 13:29:54;0001; pbs_mom;Job;is_request;is_request:
10.1.1.35 added to okclients
10/12/2005 13:29:54;0008; pbs_mom;Job;do_rpp;got an inter-server request
10/12/2005 13:29:54;0001; pbs_mom;Job;is_request;stream 0 version 1
10/12/2005 13:29:54;0001; pbs_mom;Job;is_request;command 2 received
10/12/2005 13:29:54;0001; pbs_mom;Job;is_request;IS_CLUSTER_ADDRS received
10/12/2005 13:29:54;0001; pbs_mom;Job;is_request;is_request:
10.1.1.34 added to okclients
10/12/2005 13:29:54;0001; pbs_mom;Job;is_request;is_request:
10.1.1.35 added to okclients
10/12/2005 13:29:54;0008; pbs_mom;Job;do_rpp;got an inter-server request
10/12/2005 13:29:54;0001; pbs_mom;Job;is_request;stream 0 version 1
10/12/2005 13:29:54;0001; pbs_mom;Job;is_request;command 2 received
10/12/2005 13:29:54;0001; pbs_mom;Job;is_request;IS_CLUSTER_ADDRS received
10/12/2005 13:29:54;0001; pbs_mom;Job;is_request;is_request:
10.1.1.34 added to okclients
10/12/2005 13:29:54;0001; pbs_mom;Job;is_request;is_request:
10.1.1.35 added to okclients
10/12/2005 13:30:14;0002; pbs_mom;n/a;is_update_stat;status update
successfully sent to server
10/12/2005 13:30:14;0002; pbs_mom;n/a;is_update_stat;status update
successfully sent to server
10/12/2005 13:30:36;0002; pbs_mom;n/a;is_update_stat;status update
successfully sent to server
10/12/2005 13:30:36;0002; pbs_mom;n/a;is_update_stat;status update
successfully sent to server
10/12/2005 13:30:36;0008; pbs_mom;Job;process_request;request type
QueueJob from host r2cn09.hpc.net allowed
10/12/2005 13:30:36;0008; pbs_mom;Job;process_request;request type
JobScript from host r2cn09.hpc.net allowed
10/12/2005 13:30:36;0008; pbs_mom;Job;process_request;request type
ReadyToCommit from host r2cn09.hpc.net allowed
10/12/2005 13:30:36;0008; pbs_mom;Job;process_request;request type
Commit from host r2cn09.hpc.net allowed
10/12/2005 13:30:36;0001; pbs_mom;Job;job_nodes;0: r2cn10/0
10/12/2005 13:30:36;0001; pbs_mom;Job;job_nodes;job: 0.r2cn09
numnodes=1 numvnod=1
10/12/2005 13:31:48;0008; pbs_mom;Job;0.r2cn09;evaluating limits for job
10/12/2005 13:31:48;0001; pbs_mom;Job;0.r2cn09;phase 2 of job launch
successfully completed
10/12/2005 13:31:48;0001; pbs_mom;Job;TMomFinalizeJob3;read start
return code=0 session=24804
10/12/2005 13:31:48;0001; pbs_mom;Job;TMomFinalizeJob3;job 0.r2cn09
started, pid = 24804
10/12/2005 13:31:48;0001; pbs_mom;Job;0.r2cn09;job successfully started
10/12/2005 13:31:48;0008; pbs_mom;Job;0.r2cn09;job 0.r2cn09 reported
successful start on 1 node(s)
10/12/2005 13:31:48;0002; pbs_mom;n/a;is_update_stat;status update
successfully sent to server
10/12/2005 13:31:48;0002; pbs_mom;n/a;is_update_stat;status update
successfully sent to server
10/12/2005 13:31:48;0008; pbs_mom;Req;send_sisters;sending command
POLL_JOB (7)
10/12/2005 13:32:08;0008; pbs_mom;Job;scan_for_terminated;for job
0.r2cn09, task 1, pid=24804, exitcode=0
10/12/2005 13:32:08;0080; pbs_mom;Job;0.r2cn09;scan_for_terminated:
job 0.r2cn09 task 1 terminated, sid 24804
10/12/2005 13:32:08;0008; pbs_mom;Job;0.r2cn09;Terminated
10/12/2005 13:32:08;0008; pbs_mom;Req;send_sisters;sending command
KILL_JOB (2)
10/12/2005 13:32:08;0080; pbs_mom;Job;0.r2cn09;local task termination
detected. killing job
10/12/2005 13:32:08;0008; pbs_mom;Job;0.r2cn09;kill_job
10/12/2005 13:32:08;0080; pbs_mom;Job;0.r2cn09;performing job clean-up
10/12/2005 13:32:08;0001; pbs_mom;Job;0.r2cn09;server rejected job
obit - unexpected job state
10/12/2005 13:32:08;0080; pbs_mom;Job;mom_deljob;deleting job 0.r2cn09
in state OBIT
10/12/2005 13:32:08;0008; pbs_mom;Job;scan_for_terminated;pid 24833
not tracked, exitcode=0
10/12/2005 13:32:18;0002; pbs_mom;n/a;is_update_stat;status update
successfully sent to server
10/12/2005 13:32:18;0002; pbs_mom;n/a;is_update_stat;status update
successfully sent to server
10/12/2005 13:32:27;0002; pbs_mom;Svr;pbs_mom;caught signal 15:
leaving jobs running, just exiting
10/12/2005 13:32:27;0002; pbs_mom;Svr;pbs_mom;Is down
10/12/2005 13:32:27;0002; pbs_mom;Svr;Log;Log closed
rsh works without password from each box to the other - although I know
I'm not staging anything yet.
Now the only thing I change in this whole setup to make it work is add
"--enable-debug" to the configure command when I build torque. I'm
using nfs installs so I try with a brand new box every time. It would
be really sweet to make this work without debug enabled so if there's
any other info I can provide, just let me know and it's on its way.
Thanks, Andrew
More information about the torqueusers
mailing list