[torqueusers] torque problem: submitting jobs from nodes

Jim Kusznir jkusznir at gmail.com
Tue Sep 6 18:11:27 MDT 2011

Hi All:

I've got a user who's trying to have his jobs checkpoint and re-queue
themselves at the end of their runtime so as to allow it to run with
shorter walltime limits (and thus help balance cluster usage and fair
share, etc).  Of course, for this to work, he needs to be able to
submit jobs (qsub) from the comptute nodes.  I figured this should be
no big deal, and check my qmgr settings:

Qmgr: print server
# Create queues and set their attributes.
# Create and define queue default
create queue default
set queue default queue_type = Execution
set queue default resources_max.walltime = 24:00:00
set queue default resources_default.nodes = 1
set queue default resources_default.walltime = 01:00:00
set queue default enabled = True
set queue default started = True
# Create and define queue long
create queue long
set queue long queue_type = Execution
set queue long enabled = True
set queue long started = True
# Set server attributes.
set server scheduling = True
set server acl_host_enable = False
set server acl_user_enable = False
set server managers = kusznir at aeolus.wsu.edu
set server managers += maui at aeolus.wsu.edu
set server managers += root at aeolus.wsu.edu
set server default_queue = default
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_available.nodect = 288
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 304175

Unfortunately, when one tries to submit a job from a compute node, one gets:
[kusznir at compute-0-20 ~]$ qsub -I -l nodes=1
qsub: Bad UID for job execution MSG=ruserok failed validating
kusznir/kusznir from compute-0-20.local

What's going on here?  As far as I can read, all the settings are set
to allow this to work.  What's wrong?


More information about the torqueusers mailing list