[torqueusers] Cannot submit Job to default batch queue

Vlad Popa vlad at cosy.sbg.ac.at
Wed Aug 24 12:45:07 MDT 2011


Hi!

We are running   torque 3.0.2 and Maui 3.3-6 as  scheduler  on a PUIAS 
Linux 6.1  64 bit distribution  (equiv RHEL 6.x , installed  through  
distro-RPMs).
We have  7 gpu nodes gpu01-07 and 1  gpu Master/portal host (hostname 
gpu)  from which  the jobs  should be submited.   So pbs_server  is 
running  on gpu,
pbs_mom on the nodes 1-7. The home directory is  exported via NFS from 
GPU to the nodes, authentification is NIS based, ssh login works  
passwordless on all nodes.

pbsnodes -a  shows all 7 nodes up and  state free.

...
  gpu01
      state = free
      np = 6
      properties = i7
      ntype = cluster
      status = 
rectime=1314217126,varattr=,jobs=,state=free,netload=269864473916,gres=,loadave=2.91,ncpus=8,physmem=16315316kb,availmem=47091716kb,totmem=49083308kb,idletime=615,nusers=1,nsessions=1,sessions=28415,uname=Linux 
gpu01 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011 
x86_64,opsys=linux
      mom_service_port = 15002
      mom_manager_port = 15003
      gpus = 2

gpu02
      state = free
      np = 6
      properties = i7
      ntype = cluster
      status = 
rectime=1314217134,varattr=,jobs=,state=free,netload=330061706733,gres=,loadave=0.00,ncpus=8,physmem=12187556kb,availmem=44198668kb,totmem=44955548kb,idletime=1158667,nusers=0,nsessions=0,uname=Linux 
gpu02 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011 
x86_64,opsys=linux
      mom_service_port = 15002
..
etc etc ..

I've configured  the example batch queue  as described in the manual.

qmgr -c 'p s' :

      create queue batch
set queue batch queue_type = Execution
set queue batch acl_users = peter
set queue batch acl_users += vlad
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server acl_hosts = gpu
set server acl_hosts += gpu07
set server acl_hosts += gpu06
set server acl_hosts += gpu05
set server acl_hosts += gpu04
set server acl_hosts += gpu03
set server acl_hosts += gpu02
set server acl_hosts += gpu01
set server acl_hosts += 127.0.0.1
set server acl_roots = root@*
set server managers = root at gpu07
set server operators = root at gpu
set server operators += vlad at gpu
set server operators += vlad at gpu07
set server operators += vlad at gpu06
set server operators += vlad at gpu05
set server operators += vlad at gpu04
set server operators += vlad at gpu03
set server operators += vlad at gpu02
set server operators += vlad at gpu01
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 11

maui is running,  showq  tells me there  is that batch queue and the 
ressources are free and and waiting  for work.

Every time, when I'm submitting the test job  "echo "sleep 30"|qsub  
from host gpu  i get the error "invalid request" and  /var/log/messages 
states :

"...
Aug 24 21:48:42 gpu pbs_server: LOG_ERROR::Success (0) in req_commit, 
cannot commit job in unexpected state
Aug 24 22:08:02 gpu pbs_server: LOG_ERROR::Success (0) in req_commit, 
cannot commit job in unexpected state
..."


   And my queue remains  empty ..

  pinging from the nodes  to gpu and vice versa succeeds,  momctl -d 3 
on the nodes

  What could this be, where shall I look into ?

Any help appreciated ..
Thanks in advance
Greetings from Salzburg/Austria/Europe

Vlad





More information about the torqueusers mailing list