[torqueusers] Problem starting jobs with multiple nodes.
John Hanks
griznog at gmail.com
Wed Oct 8 08:29:39 MDT 2008
Hi,
I'm setting up a small cluster and have hit a snag with torque. I can
submit and run jobs that use a single node without any problems, but
jobs that use more than one node bounce back and forth between Q and R
states and never start. The erros I see are:
Oct 8 10:22:29 node-0012 pbs_mom: Success (0) in init_groups, pre-sigprocmask
Oct 8 10:22:29 node-0012 pbs_mom: Success (0) in init_groups, post-initgroups
Oct 8 10:22:29 node-0012 pbs_mom: Bad UID for job execution (15023)
in 23.batman.broad.mit.edu, job_start_error from node
10.128.1.11:15003 in job_start_error
Oct 8 10:22:29 node-0012 pbs_mom: Bad UID for job execution (15023)
in 23.batman.broad.mit.edu, abort attempted 16 times in
job_start_error. ignoring abort request from node 10.128.1.11:15003
Oct 8 10:22:29 node-0012 pbs_mom: Job with requested ID already
exists (15009) in im_request, KILL/ABORT request for job
23.batman.broad.mit.edu returned error 15009
Repeated until I kill the job.
My torque config is:
Qmgr: p s
#
# Create queues and set their attributes.
#
#
# Create and define queue mpiblast
#
create queue mpiblast
set queue mpiblast queue_type = Execution
set queue mpiblast resources_max.walltime = 24:00:00
set queue mpiblast resources_default.walltime = 01:00:00
set queue mpiblast enabled = True
set queue mpiblast started = True
#
# Create and define queue route
#
create queue route
set queue route queue_type = Route
set queue route route_destinations = mpiblast
set queue route enabled = True
set queue route started = True
#
# Set server attributes.
#
set server scheduling = True
set server default_queue = route
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 300
set server node_check_rate = 150
set server tcp_timeout = 6
set server log_level = 1
set server pbs_version = 2.2.1
and maui.cfg is:
SERVERHOST batman.broad.mit.edu
SERVERMODE NORMAL
ADMIN1 root
RMCFG[batman] TYPE=PBS
AMCFG[bank] TYPE=NONE
RMPOLLINTERVAL 00:00:30
SERVERPORT 42559
SERVERMODE NORMAL
LOGFILE maui.log
LOGFILEMAXSIZE 10000000
LOGLEVEL 7
QUEUETIMEWEIGHT 1
BACKFILLPOLICY FIRSTFIT
RESERVATIONPOLICY CURRENTHIGHEST
NODEALLOCATIONPOLICY MINRESOURCE
REMAPCLASS route
REMAPCLASSLIST mpiblast
Any insight into what I'm doing wrong would be appreciated.
Thanks,
jbh
More information about the torqueusers
mailing list