[torqueusers] Problem starting jobs with multiple nodes.

John Hanks griznog at gmail.com
Wed Oct 8 08:29:39 MDT 2008


Hi,

I'm setting up a small cluster and have hit a snag with torque. I can
submit and run jobs that use a single node without any problems, but
jobs that use more than one node bounce back and forth between Q and R
states and never start. The erros I see are:

Oct  8 10:22:29 node-0012 pbs_mom: Success (0) in init_groups, pre-sigprocmask
Oct  8 10:22:29 node-0012 pbs_mom: Success (0) in init_groups, post-initgroups
Oct  8 10:22:29 node-0012 pbs_mom: Bad UID for job execution (15023)
in 23.batman.broad.mit.edu, job_start_error from node
10.128.1.11:15003 in job_start_error
Oct  8 10:22:29 node-0012 pbs_mom: Bad UID for job execution (15023)
in 23.batman.broad.mit.edu, abort attempted 16 times in
job_start_error.  ignoring abort request from node 10.128.1.11:15003
Oct  8 10:22:29 node-0012 pbs_mom: Job with requested ID already
exists (15009) in im_request, KILL/ABORT request for job
23.batman.broad.mit.edu returned error 15009

Repeated until I kill the job.

My torque config is:

Qmgr: p s
#
# Create queues and set their attributes.
#
#
# Create and define queue mpiblast
#
create queue mpiblast
set queue mpiblast queue_type = Execution
set queue mpiblast resources_max.walltime = 24:00:00
set queue mpiblast resources_default.walltime = 01:00:00
set queue mpiblast enabled = True
set queue mpiblast started = True
#
# Create and define queue route
#
create queue route
set queue route queue_type = Route
set queue route route_destinations = mpiblast
set queue route enabled = True
set queue route started = True
#
# Set server attributes.
#
set server scheduling = True
set server default_queue = route
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 300
set server node_check_rate = 150
set server tcp_timeout = 6
set server log_level = 1
set server pbs_version = 2.2.1

and maui.cfg is:

SERVERHOST            batman.broad.mit.edu
SERVERMODE            NORMAL
ADMIN1                root
RMCFG[batman] TYPE=PBS
AMCFG[bank]  TYPE=NONE
RMPOLLINTERVAL        00:00:30
SERVERPORT            42559
SERVERMODE            NORMAL
LOGFILE               maui.log
LOGFILEMAXSIZE        10000000
LOGLEVEL              7
QUEUETIMEWEIGHT       1
BACKFILLPOLICY        FIRSTFIT
RESERVATIONPOLICY     CURRENTHIGHEST
NODEALLOCATIONPOLICY  MINRESOURCE
REMAPCLASS           route
REMAPCLASSLIST       mpiblast

Any insight into what I'm doing wrong would be appreciated.

Thanks,

jbh


More information about the torqueusers mailing list