[torqueusers] [Mauiusers] Maui-Torque integration problems

Jim Kusznir jkusznir at gmail.com
Wed Dec 9 18:49:19 MST 2009


I just completely replaced my torque install with a fresh build using
the RPM, getting the home dir in /opt/torque with the rest of the
torque stuff, and all.  I reconfigured torque from scratch as part of
the process, but still no go.  Here is a summary of all my configs:

torque built with the included spec file, mods in my last e-mail
included.  Final configure_args statement:
%define configure_args --disable-gcc-warnings --prefix=/opt/torque
--with-server-home=/opt/torque --without-tcl

kusznir at isp-curran:/opt/torque> qmgr -c 'p s'
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = isp-curran
set server managers = kusznir at isp-curran.isp.wsu.edu
set server managers += maui at isp-curran.isp.wsu.edu
set server managers += root at isp-curran.isp.wsu.edu
set server operators = kusznir at isp-curran.isp.wsu.edu
set server operators += maui at isp-curran.isp.wsu.edu
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server mom_job_sync = True
set server keep_completed = 300
set server next_job_number = 6
kusznir at isp-curran:/opt/torque> qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
2.isp-curran              STDIN            kusznir                0 Q
batch
3.isp-curran              STDIN            kusznir                0 Q
batch
4.isp-curran              STDIN            kusznir                0 Q
batch
5.isp-curran              STDIN            kusznir                0 Q
batch
kusznir at isp-curran:/opt/torque> showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME


     0 Active Jobs       0 of  256 Processors Active (0.00%)
                         0 of    1 Nodes Active      (0.00%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


Total Jobs: 0   Active Jobs: 0   Idle Jobs: 0   Blocked Jobs: 0
kusznir at isp-curran:/opt/torque> diagnose -j 5
Name                  State Par Proc QOS     WCLimit R  Min     User
 Group  Account  QueuedTime  Network  Opsys   Arch    Mem   Disk
Procs       Class Features


kusznir at isp-curran:/opt/torque> checkjob 5
ERROR:    'checkjob' failed
ERROR:  cannot locate job '5'

kusznir at isp-curran:/opt/maui> cat maui.cfg
# maui.cfg 3.2.6p20

SERVERHOST            isp-curran
# primary admin must be first in list
ADMIN1                maui root kusznir

# Resource Manager Definition

RMCFG[isp-curran] TYPE=PBS HOST=isp-curran.isp.wsu.edu

# Allocation Manager Definition

AMCFG[bank]  TYPE=NONE

# full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html
# use the 'schedctl -l' command to display current configuration

RMPOLLINTERVAL        00:00:30

SERVERPORT            42559
SERVERMODE            NORMAL

# Admin: http://supercluster.org/mauidocs/a.esecurity.html


LOGFILE               maui.log
LOGFILEMAXSIZE        10000000
LOGLEVEL              3

# Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html

QUEUETIMEWEIGHT       1

# Throttling Policies:
http://supercluster.org/mauidocs/6.2throttlingpolicies.html

# NONE SPECIFIED

# Backfill: http://supercluster.org/mauidocs/8.2backfill.html

BACKFILLPOLICY        FIRSTFIT
RESERVATIONPOLICY     CURRENTHIGHEST

# Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html

NODEALLOCATIONPOLICY  MINRESOURCE


maui log startup through first loop attached as "sample log".  Here is
the torque log from startup through present:

12/09/2009 17:02:10;0002;PBS_Server;Svr;Log;Log opened
12/09/2009 17:02:10;0006;PBS_Server;Svr;PBS_Server;Server
isp-curran.isp.wsu.edu started, initialization type = 1
12/09/2009 17:02:10;0002;PBS_Server;Svr;Act;Account file
/opt/torque/server_priv/accounting/20091209 opened
12/09/2009 17:02:10;0040;PBS_Server;Req;setup_nodes;setup_nodes()
12/09/2009 17:02:10;0086;PBS_Server;Svr;PBS_Server;Recovered queue batch
12/09/2009 17:02:10;0002;PBS_Server;Svr;PBS_Server;Expected 1,
recovered 1 queues
12/09/2009 17:02:10;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 jobs
12/09/2009 17:02:10;0006;PBS_Server;Svr;PBS_Server;Using ports
Server:15001  Scheduler:15004  MOM:15002 (server:
'isp-curran.isp.wsu.edu')
12/09/2009 17:02:10;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid =
128752, loglevel=0
12/09/2009 17:02:10;0004;PBS_Server;Svr;WARNING;ALERT: unable to
contact node isp-curran
12/09/2009 17:02:15;0040;PBS_Server;Req;ping_nodes;ping attempting to
contact 1 nodes
12/09/2009 17:02:15;0040;PBS_Server;Req;ping_nodes;successful ping to
node isp-curran (stream 0)
12/09/2009 17:02:15;0002;PBS_Server;Svr;PBS_Server;Torque Server
Version = 2.4.2, loglevel = 0
12/09/2009 17:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::stream_eof,
connection to isp-curran is bad, remote service may be down, message
may be corrupt, or connection may have been dropped remotely
(Premature end of message).  setting node state to down
12/09/2009 17:07:15;0002;PBS_Server;Svr;PBS_Server;Torque Server
Version = 2.4.2, loglevel = 0
12/09/2009 17:12:15;0002;PBS_Server;Svr;PBS_Server;Torque Server
Version = 2.4.2, loglevel = 0
12/09/2009 17:17:15;0002;PBS_Server;Svr;PBS_Server;Torque Server
Version = 2.4.2, loglevel = 0
12/09/2009 17:21:17;0100;PBS_Server;Job;2.isp-curran.isp.wsu.edu;enqueuing
into batch, state 1 hop 1
12/09/2009 17:21:17;0008;PBS_Server;Job;2.isp-curran.isp.wsu.edu;Job
Queued at request of kusznir at isp-curran.isp.wsu.edu, owner =
kusznir at isp-curran.isp.wsu.edu, job name = STDIN, queue = batch
12/09/2009 17:21:17;0040;PBS_Server;Svr;isp-curran.isp.wsu.edu;Scheduler
was sent the command scheduler_first
12/09/2009 17:21:53;0100;PBS_Server;Job;3.isp-curran.isp.wsu.edu;enqueuing
into batch, state 1 hop 1
12/09/2009 17:21:53;0008;PBS_Server;Job;3.isp-curran.isp.wsu.edu;Job
Queued at request of kusznir at isp-curran.isp.wsu.edu, owner =
kusznir at isp-curran.isp.wsu.edu, job name = STDIN, queue = batch
12/09/2009 17:21:53;0040;PBS_Server;Svr;isp-curran.isp.wsu.edu;Scheduler
was sent the command new
12/09/2009 17:21:55;0100;PBS_Server;Job;4.isp-curran.isp.wsu.edu;enqueuing
into batch, state 1 hop 1
12/09/2009 17:21:55;0008;PBS_Server;Job;4.isp-curran.isp.wsu.edu;Job
Queued at request of kusznir at isp-curran.isp.wsu.edu, owner =
kusznir at isp-curran.isp.wsu.edu, job name = STDIN, queue = batch
12/09/2009 17:21:55;0040;PBS_Server;Svr;isp-curran.isp.wsu.edu;Scheduler
was sent the command new
12/09/2009 17:21:56;0100;PBS_Server;Job;5.isp-curran.isp.wsu.edu;enqueuing
into batch, state 1 hop 1
12/09/2009 17:21:56;0008;PBS_Server;Job;5.isp-curran.isp.wsu.edu;Job
Queued at request of kusznir at isp-curran.isp.wsu.edu, owner =
kusznir at isp-curran.isp.wsu.edu, job name = STDIN, queue = batch
12/09/2009 17:21:56;0040;PBS_Server;Svr;isp-curran.isp.wsu.edu;Scheduler
was sent the command new
12/09/2009 17:22:15;0002;PBS_Server;Svr;PBS_Server;Torque Server
Version = 2.4.2, loglevel = 0
12/09/2009 17:27:15;0002;PBS_Server;Svr;PBS_Server;Torque Server
Version = 2.4.2, loglevel = 0
12/09/2009 17:31:56;0040;PBS_Server;Svr;isp-curran.isp.wsu.edu;Scheduler
was sent the command time
12/09/2009 17:32:15;0002;PBS_Server;Svr;PBS_Server;Torque Server
Version = 2.4.2, loglevel = 0
12/09/2009 17:37:15;0002;PBS_Server;Svr;PBS_Server;Torque Server
Version = 2.4.2, loglevel = 0
12/09/2009 17:41:56;0040;PBS_Server;Svr;isp-curran.isp.wsu.edu;Scheduler
was sent the command time
12/09/2009 17:42:15;0002;PBS_Server;Svr;PBS_Server;Torque Server
Version = 2.4.2, loglevel = 0
---------------

I'm really dumbfoudned by this problem...I've never encoutered this
before.  I don't know how I can debug this any further without digging
into the source code...Which I don't think I should have to do to run
a "standard" torque+maui configuration.....  I'd really appreciate any
help in this.

Thanks!

--Jim
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sample.log
Type: text/x-log
Size: 116644 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20091209/36973920/attachment-0001.bin 


More information about the torqueusers mailing list