[Mauiusers] torque/maui integration - cannot set hostlist error

John Kitchin jkitchin at andrew.cmu.edu
Sun Dec 21 18:32:08 MST 2008

Hi everyone,

I am in the process of replacing PBSPro on our cluster with Torque/Maui. I
have installed the latest versions of Torque and Maui, and Torque appears to
run fine on its own and runs jobs. The installations seem to have gone well
according to the directions and tests. I have not been able to get maui to
schedule jobs though (after stopping pbs_sched and starting maui as user
jtest), they just remain in the queue in a deferred state.

our basic setup is a login/submit node where pbs_server and maui run called
beowulf (beowulf.cheme.cmu.edu is the full name), with the execute nodes on
an internal network.

Typical output of checkjob on a deferred job is:

job is deferred.  Reason:  RMFailure  (job cannot be started - cannot set
Holds:    Defer  (hold reason:  RMFailure)
PE:  1.00  StartPriority:  2
cannot select job 52 for partition DEFAULT (job hold active)

the torque log indicates an error connecting to MOM:
12/21/2008 18:04:32;0008;PBS_Server;Job;52.beowulf;Job Modified at request
of jtest at beowulf
12/21/2008 18:04:32;0001;PBS_Server;Req;;Server could not connect to MOM
12/21/2008 18:04:32;0080;PBS_Server;Req;req_reject;Reject reply
code=15070(Server could not connect to MOM), aux=0, type=ModifyJob, from
jtest at beowulf
12/21/2008 18:05:16;0002;PBS_Server;Svr;PBS_Server;Torque Server Version =
2.4.0b1, loglevel = 0

maui is running as the user jtest, and jtest is a manager and operator in
torque and as admin1 in maui

some output from qmgr -c 'p s'

set server scheduling = True
set server acl_hosts = beowulf
set server managers = jtest at beowulf
set server operators = jtest at beowulf
set server default_queue = q_feed
set server log_events = 255
set server mail_from = ChemE-beowulf-PBS
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server comment = ChemE Beowulf Cluster
set server next_job_number = 53

top of maui.cfg
# maui.cfg 3.2.6p20
SERVERHOST            beowulf
# primary admin must be first in list
ADMIN1                jtest
# Resource Manager Definition

on the nodes, the mom config files contain
matsim (jtest) ~ > ssh c1n10 'cat /var/spool/torque/mom_priv/config'
$clienthost beowulf
$restricted *.cheme.cmu.edu

Does anything stand out as wrong here? I have tried several variations of
settings of parameters above with no luck at getting maui to work. any
suggestions? thanks,


John Kitchin
Assistant Professor
NETL-IAES Resident Institute Fellow
Doherty Hall A207F
Department of Chemical Engineering
Carnegie Mellon University
Pittsburgh, PA 15213
