[torqueusers] torque requires --enable-debug to work

Andrew Sharpe andrew.sharpe1 at jcu.edu.au
Tue Oct 11 21:50:35 MDT 2005


Hi all,

I'm fairly new to torque but I think this is non-standard behaviour.  
I've verified that this problem exists on
torque-1.2.0p4
torque-1.2.0p6
torque-1.2.0p7-snap.1127772314
using CentOS4.1 (full install) on x86_64.

The problem is that torque only works if I compile it with 
--enable-debug.  Here's the steps I follow to obtain the results:

NOTE: pbs is a CNAME to machine1, which correctly resolves to 10.1.1.12, 
machine2 correctly resolves to 10.1.1.13 - by resolves I mean forward 
and reverse lookups are ok.

1. compile any of the above versions using the following commands
./configure --prefix=/usr/local --set-default-server=pbs 
--set-server-home=/var/spool/PBS --enable-server --enable-docs 
--enable-mom --enable-clients --enable-syslog --disable-rpp
make
make install

2. configure the server
    a. add the client nodes to /var/spool/PBS/server_priv/nodes
    b. set the default server in /var/spool/PBS/server_name
    c. configure a queue (taken from 
http://www.clusterresources.com/products/torque/docs/manualconfig.shtml)
        pbs_server -t create
        qmgr -c "set server scheduling=true"
        qmgr -c "create queue batch queue_type=execution"
        qmgr -c "set queue batch started=true"
        qmgr -c "set queue batch enabled=true"
        qmgr -c "set queue batch resources_default.nodes=1"
        qmgr -c "set queue batch resources_default.walltime=3600"
        qmgr -c "set server default_queue=batch"
       
     d. stop the server
        qterm -t quick

3. configure the client(s)
    a. set the default server (for client commands) in 
/var/spool/PBS/server_name
    b. configure mom (in /var/spool/PBS/mom_priv/config
       $logevent 255
       $loglevel 4
       $pbsserver pbs
       $clienthost 10.1.1.13
       $cputmult 1.0
       $wallmult 1.0
       $max_load 2.5
       $ideal_load 2.0


4. start the server - note that I don't use a scheduler here, the 
problem occurs whether using pbs_sched, maui or qrun
    /usr/local/sbin/pbs_server
    /usr/local/sbin/pbs_mom

5. start the client
    /usr/local/sbin/pbs_mom


Now for some diags:

Both machines are using ntp, and here's the output of ntpq -p for each 
machine

computer1
     remote           refid      st t when poll reach   delay   offset  
jitter
==============================================================================
*ntp             LOCAL(0)        11 u  903 1024  377    0.105   -0.098   
0.083


computer2
     remote           refid      st t when poll reach   delay   offset  
jitter
==============================================================================
*ntp             LOCAL(0)        11 u  499 1024  377    0.116    0.283   
0.008



On the server, run pbsnodes -a

computer1
     state = free
     np = 1
     ntype = cluster
     status = arch=linux,uname=Linux computer1 2.6.9-11.ELsmp #1 SMP Wed 
Jun 8 16:59:12 CDT 2005 x86_64,sessions=1329 2261 
26560,nsessions=3,nusers=1,idletime=2032,totme
m=5111944kb,availmem=4927996kb,physmem=4059696kb,ncpus=2,loadave=0.00,netload=17951093,state=free,rectime=1129082554

computer2
     state = free
     np = 1
     ntype = cluster
     status = arch=linux,uname=Linux computer2 2.6.9-11.ELsmp #1 SMP Wed 
Jun 8 16:59:12 CDT 2005 x86_64,sessions=1269 
1995,nsessions=2,nusers=1,idletime=2033,totmem=5111
944kb,availmem=4931696kb,physmem=4059696kb,ncpus=2,loadave=0.00,netload=17553961,state=free,rectime=1129082545



Now as a normal user on computer1, submit a job

echo "sleep 20" | qsub -r n

Check the queue

qstat

computer 1
Job id           Name             User             Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
0.computer1      STDIN            hpc                     0 Q 
batch          


qstat -an
computer 1
                                                            Req'd  
Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  
S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- 
- -----
0.computer1     hpc      batch    STDIN         --   --  --    --  01:00 
Q   --
    --


As we aren't using a scheduler at the moment (to limit the search 
space), run the job as root on computer1

qrun -H computer2 0

qrun: Execution server rejected request MSG=send failed, STARTING 
0.computer1


Here's another look at the queue after the above sequence of events

qstat -an
computer1
                                                            Req'd  
Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  
S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- 
- -----
0.computer1     hpc      batch    STDIN         --   --  --    --  01:00 
Q   --
   computer2/0


So the job appears to not be running, however the wierdest bit is that 
the job actually does run about a minute later, so here's the logs of 
pbs_server on computer1 and mom on computer2

pbs_server on computer1
==================
10/12/2005 13:29:50;0002;PBS_Server;Svr;Log;Log opened
10/12/2005 13:29:50;0006;PBS_Server;Svr;PBS_Server;Server r2cn09 
started, initialization type = 1
10/12/2005 13:29:50;0002;PBS_Server;Svr;Act;Account file 
/var/spool/PBS/server_priv/accounting/20051012 opened
10/12/2005 13:29:50;0040;PBS_Server;Req;setup_nodes;setup_nodes()

10/12/2005 13:29:50;0086;PBS_Server;Svr;PBS_Server;Recovered queue batch
10/12/2005 13:29:50;0002;PBS_Server;Svr;PBS_Server;Expected 1, recovered 
1 queues
10/12/2005 13:29:50;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 
0 jobs
10/12/2005 13:29:50;0006;PBS_Server;Svr;PBS_Server;Using ports 
Server:15001  Scheduler:15004  MOM:15002
10/12/2005 13:29:50;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 
26626, loglevel=0
10/12/2005 13:29:50;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact 
node r2cn09
10/12/2005 13:29:50;0001;PBS_Server;Svr;PBS_Server;Connection refused 
(111) in contact_sched, Could not contact Scheduler - port 15004
10/12/2005 13:29:54;0040;PBS_Server;Req;is_stat_get;node r2cn10 marked 
available
10/12/2005 13:29:56;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from root at r2cn09, sock=11
10/12/2005 13:29:56;0100;PBS_Server;Req;;Type StatusNode request 
received from root at r2cn09, sock=10
10/12/2005 13:30:00;0040;PBS_Server;Req;is_stat_get;node r2cn09 marked 
available
10/12/2005 13:30:02;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from root at r2cn09, sock=11
10/12/2005 13:30:02;0100;PBS_Server;Req;;Type StatusNode request 
received from root at r2cn09, sock=10
10/12/2005 13:30:05;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from hpc at r2cn09, sock=11
10/12/2005 13:30:05;0100;PBS_Server;Req;;Type QueueJob request received 
from hpc at r2cn09, sock=10
10/12/2005 13:30:05;0100;PBS_Server;Req;;Type JobScript request received 
from hpc at r2cn09, sock=10
10/12/2005 13:30:05;0100;PBS_Server;Req;;Type ReadyToCommit request 
received from hpc at r2cn09, sock=10
10/12/2005 13:30:05;0100;PBS_Server;Req;;Type Commit request received 
from hpc at r2cn09, sock=10
10/12/2005 13:30:05;0100;PBS_Server;Job;0.r2cn09;enqueuing into batch, 
state 1 hop 1
10/12/2005 13:30:05;0008;PBS_Server;Job;0.r2cn09;Job Queued at request 
of hpc at r2cn09, owner = hpc at r2cn09, job name = STDIN, queue = batch
10/12/2005 13:30:05;0001;PBS_Server;Svr;PBS_Server;Connection refused 
(111) in contact_sched, Could not contact Scheduler - port 15004
10/12/2005 13:30:07;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from hpc at r2cn09, sock=11
10/12/2005 13:30:07;0100;PBS_Server;Req;;Type StatusJob request received 
from hpc at r2cn09, sock=10
10/12/2005 13:30:09;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from hpc at r2cn09, sock=11
10/12/2005 13:30:09;0100;PBS_Server;Req;;Type StatusServer request 
received from hpc at r2cn09, sock=10
10/12/2005 13:30:09;0100;PBS_Server;Req;;Type StatusJob request received 
from hpc at r2cn09, sock=10
10/12/2005 13:30:36;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from root at r2cn09, sock=11
10/12/2005 13:30:36;0100;PBS_Server;Req;;Type RunJob request received 
from root at r2cn09, sock=10
10/12/2005 13:30:36;0008;PBS_Server;Job;0.r2cn09;Job Run at request of 
root at r2cn09
10/12/2005 13:30:42;0001;PBS_Server;Svr;PBS_Server;Operation now in 
progress (115) in send_job, send_job commit failed, rc=15031 (Premature 
end of message)
10/12/2005 13:30:42;0008;PBS_Server;Job;0.r2cn09;unable to run job, MOM 
rejected/rc=1
10/12/2005 13:30:42;0080;PBS_Server;Req;req_reject;Reject reply 
code=15041(Execution server rejected request MSG=send failed, STARTING), 
aux=0, type=RunJob, from root at r2
cn09
10/12/2005 13:30:42;0001;PBS_Server;Svr;PBS_Server;Connection refused 
(111) in contact_sched, Could not contact Scheduler - port 15004
10/12/2005 13:31:35;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from hpc at r2cn09, sock=11
10/12/2005 13:31:35;0100;PBS_Server;Req;;Type StatusServer request 
received from hpc at r2cn09, sock=10
10/12/2005 13:31:35;0100;PBS_Server;Req;;Type StatusJob request received 
from hpc at r2cn09, sock=10
10/12/2005 13:32:08;0100;PBS_Server;Req;;Type JobObituary request 
received from pbs_mom at r2cn10.hpc.net, sock=10
10/12/2005 13:32:08;0009;PBS_Server;Job;0.r2cn09;obit received for job 
0.r2cn09 from host r2cn10.hpc.net with bad state (state: QUEUED)
10/12/2005 13:32:08;0080;PBS_Server;Req;req_reject;Reject reply 
code=15016(Request invalid for state of job), aux=0, type=JobObituary, 
from pbs_mom at r2cn10.hpc.net
10/12/2005 13:32:31;0086;PBS_Server;Svr;PBS_Server;Starting to shutdown 
the server, type is By Signal
10/12/2005 13:32:31;0002;PBS_Server;Svr;PBS_Server;Server shutdown completed
10/12/2005 13:32:31;0002;PBS_Server;Svr;Log;Log closed


mom on computer2
==============
10/12/2005 13:29:44;0002;   pbs_mom;Svr;Log;Log opened
10/12/2005 13:29:44;0002;   pbs_mom;Svr;setloglevel;4
10/12/2005 13:29:44;0002;   pbs_mom;Svr;restricted;*.hpc.net
10/12/2005 13:29:44;0002;   pbs_mom;Svr;cputmult;1.0
10/12/2005 13:29:44;0002;   pbs_mom;Svr;wallmult;1.0
10/12/2005 13:29:44;0002;   pbs_mom;Svr;max_load;2.5
10/12/2005 13:29:44;0080;   pbs_mom;n/a;add_static;config[0] add name 
max_load value 2.5
10/12/2005 13:29:44;0002;   pbs_mom;Svr;ideal_load;2.0
10/12/2005 13:29:44;0080;   pbs_mom;n/a;add_static;config[0] add name 
ideal_load value 2.0
10/12/2005 13:29:44;0002;   pbs_mom;n/a;initialize;independent
10/12/2005 13:29:44;0002;   pbs_mom;Svr;pbs_mom;Is up
10/12/2005 13:29:44;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: 
added connection to pbs
10/12/2005 13:29:44;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to server
10/12/2005 13:29:44;0002;   pbs_mom;n/a;is_update_stat;hello sent to server
10/12/2005 13:29:44;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to server
10/12/2005 13:29:44;0002;   pbs_mom;n/a;is_update_stat;hello sent to server
10/12/2005 13:29:44;0001;   pbs_mom;Job;state_to_server;sent updated 
state 0x0 to server

10/12/2005 13:29:54;0008;   pbs_mom;Job;do_rpp;got an inter-server request
10/12/2005 13:29:54;0001;   pbs_mom;Job;is_request;stream 0 version 1

10/12/2005 13:29:54;0001;   pbs_mom;Job;is_request;command 2 received
10/12/2005 13:29:54;0001;   pbs_mom;Job;is_request;IS_CLUSTER_ADDRS received
10/12/2005 13:29:54;0001;   pbs_mom;Job;is_request;is_request:  
10.1.1.34 added to okclients

10/12/2005 13:29:54;0001;   pbs_mom;Job;is_request;is_request:  
10.1.1.35 added to okclients

10/12/2005 13:29:54;0008;   pbs_mom;Job;do_rpp;got an inter-server request
10/12/2005 13:29:54;0001;   pbs_mom;Job;is_request;stream 0 version 1

10/12/2005 13:29:54;0001;   pbs_mom;Job;is_request;command 2 received
10/12/2005 13:29:54;0001;   pbs_mom;Job;is_request;IS_CLUSTER_ADDRS received
10/12/2005 13:29:54;0001;   pbs_mom;Job;is_request;is_request:  
10.1.1.34 added to okclients

10/12/2005 13:29:54;0001;   pbs_mom;Job;is_request;is_request:  
10.1.1.35 added to okclients

10/12/2005 13:29:54;0008;   pbs_mom;Job;do_rpp;got an inter-server request
10/12/2005 13:29:54;0001;   pbs_mom;Job;is_request;stream 0 version 1

10/12/2005 13:29:54;0001;   pbs_mom;Job;is_request;command 2 received
10/12/2005 13:29:54;0001;   pbs_mom;Job;is_request;IS_CLUSTER_ADDRS received
10/12/2005 13:29:54;0001;   pbs_mom;Job;is_request;is_request:  
10.1.1.34 added to okclients

10/12/2005 13:29:54;0001;   pbs_mom;Job;is_request;is_request:  
10.1.1.35 added to okclients

10/12/2005 13:30:14;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to server
10/12/2005 13:30:14;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to server
10/12/2005 13:30:36;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to server
10/12/2005 13:30:36;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to server
10/12/2005 13:30:36;0008;   pbs_mom;Job;process_request;request type 
QueueJob from host r2cn09.hpc.net allowed
10/12/2005 13:30:36;0008;   pbs_mom;Job;process_request;request type 
JobScript from host r2cn09.hpc.net allowed
10/12/2005 13:30:36;0008;   pbs_mom;Job;process_request;request type 
ReadyToCommit from host r2cn09.hpc.net allowed
10/12/2005 13:30:36;0008;   pbs_mom;Job;process_request;request type 
Commit from host r2cn09.hpc.net allowed
10/12/2005 13:30:36;0001;   pbs_mom;Job;job_nodes;0: r2cn10/0
10/12/2005 13:30:36;0001;   pbs_mom;Job;job_nodes;job: 0.r2cn09 
numnodes=1 numvnod=1

10/12/2005 13:31:48;0008;   pbs_mom;Job;0.r2cn09;evaluating limits for job
10/12/2005 13:31:48;0001;   pbs_mom;Job;0.r2cn09;phase 2 of job launch 
successfully completed
10/12/2005 13:31:48;0001;   pbs_mom;Job;TMomFinalizeJob3;read start 
return code=0 session=24804

10/12/2005 13:31:48;0001;   pbs_mom;Job;TMomFinalizeJob3;job 0.r2cn09 
started, pid = 24804
10/12/2005 13:31:48;0001;   pbs_mom;Job;0.r2cn09;job successfully started
10/12/2005 13:31:48;0008;   pbs_mom;Job;0.r2cn09;job 0.r2cn09 reported 
successful start on 1 node(s)
10/12/2005 13:31:48;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to server
10/12/2005 13:31:48;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to server
10/12/2005 13:31:48;0008;   pbs_mom;Req;send_sisters;sending command 
POLL_JOB (7)
10/12/2005 13:32:08;0008;   pbs_mom;Job;scan_for_terminated;for job 
0.r2cn09, task 1, pid=24804, exitcode=0

10/12/2005 13:32:08;0080;   pbs_mom;Job;0.r2cn09;scan_for_terminated: 
job 0.r2cn09 task 1 terminated, sid 24804
10/12/2005 13:32:08;0008;   pbs_mom;Job;0.r2cn09;Terminated
10/12/2005 13:32:08;0008;   pbs_mom;Req;send_sisters;sending command 
KILL_JOB (2)
10/12/2005 13:32:08;0080;   pbs_mom;Job;0.r2cn09;local task termination 
detected.  killing job
10/12/2005 13:32:08;0008;   pbs_mom;Job;0.r2cn09;kill_job
10/12/2005 13:32:08;0080;   pbs_mom;Job;0.r2cn09;performing job clean-up
10/12/2005 13:32:08;0001;   pbs_mom;Job;0.r2cn09;server rejected job 
obit - unexpected job state
10/12/2005 13:32:08;0080;   pbs_mom;Job;mom_deljob;deleting job 0.r2cn09 
in state OBIT
10/12/2005 13:32:08;0008;   pbs_mom;Job;scan_for_terminated;pid 24833 
not tracked, exitcode=0

10/12/2005 13:32:18;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to server
10/12/2005 13:32:18;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to server
10/12/2005 13:32:27;0002;   pbs_mom;Svr;pbs_mom;caught signal 15: 
leaving jobs running, just exiting
10/12/2005 13:32:27;0002;   pbs_mom;Svr;pbs_mom;Is down
10/12/2005 13:32:27;0002;   pbs_mom;Svr;Log;Log closed


rsh works without password from each box to the other - although I know 
I'm not staging anything yet.

Now the only thing I change in this whole setup to make it work is add 
"--enable-debug" to the configure command when I build torque.  I'm 
using nfs installs so I try with a brand new box every time.  It would 
be really sweet to make this work without debug enabled so if there's 
any other info I can provide, just let me know and it's on its way.

Thanks, Andrew




More information about the torqueusers mailing list