[torqueusers] Cannot submit Job to default batch queue
Vlad Popa
vlad at cosy.sbg.ac.at
Wed Aug 24 12:45:07 MDT 2011
Hi!
We are running torque 3.0.2 and Maui 3.3-6 as scheduler on a PUIAS
Linux 6.1 64 bit distribution (equiv RHEL 6.x , installed through
distro-RPMs).
We have 7 gpu nodes gpu01-07 and 1 gpu Master/portal host (hostname
gpu) from which the jobs should be submited. So pbs_server is
running on gpu,
pbs_mom on the nodes 1-7. The home directory is exported via NFS from
GPU to the nodes, authentification is NIS based, ssh login works
passwordless on all nodes.
pbsnodes -a shows all 7 nodes up and state free.
...
gpu01
state = free
np = 6
properties = i7
ntype = cluster
status =
rectime=1314217126,varattr=,jobs=,state=free,netload=269864473916,gres=,loadave=2.91,ncpus=8,physmem=16315316kb,availmem=47091716kb,totmem=49083308kb,idletime=615,nusers=1,nsessions=1,sessions=28415,uname=Linux
gpu01 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 2
gpu02
state = free
np = 6
properties = i7
ntype = cluster
status =
rectime=1314217134,varattr=,jobs=,state=free,netload=330061706733,gres=,loadave=0.00,ncpus=8,physmem=12187556kb,availmem=44198668kb,totmem=44955548kb,idletime=1158667,nusers=0,nsessions=0,uname=Linux
gpu02 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
x86_64,opsys=linux
mom_service_port = 15002
..
etc etc ..
I've configured the example batch queue as described in the manual.
qmgr -c 'p s' :
create queue batch
set queue batch queue_type = Execution
set queue batch acl_users = peter
set queue batch acl_users += vlad
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server acl_hosts = gpu
set server acl_hosts += gpu07
set server acl_hosts += gpu06
set server acl_hosts += gpu05
set server acl_hosts += gpu04
set server acl_hosts += gpu03
set server acl_hosts += gpu02
set server acl_hosts += gpu01
set server acl_hosts += 127.0.0.1
set server acl_roots = root@*
set server managers = root at gpu07
set server operators = root at gpu
set server operators += vlad at gpu
set server operators += vlad at gpu07
set server operators += vlad at gpu06
set server operators += vlad at gpu05
set server operators += vlad at gpu04
set server operators += vlad at gpu03
set server operators += vlad at gpu02
set server operators += vlad at gpu01
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 11
maui is running, showq tells me there is that batch queue and the
ressources are free and and waiting for work.
Every time, when I'm submitting the test job "echo "sleep 30"|qsub
from host gpu i get the error "invalid request" and /var/log/messages
states :
"...
Aug 24 21:48:42 gpu pbs_server: LOG_ERROR::Success (0) in req_commit,
cannot commit job in unexpected state
Aug 24 22:08:02 gpu pbs_server: LOG_ERROR::Success (0) in req_commit,
cannot commit job in unexpected state
..."
And my queue remains empty ..
pinging from the nodes to gpu and vice versa succeeds, momctl -d 3
on the nodes
What could this be, where shall I look into ?
Any help appreciated ..
Thanks in advance
Greetings from Salzburg/Austria/Europe
Vlad
More information about the torqueusers
mailing list