[torqueusers] Job not running

Jeff Layton laytonjb at att.net
Sun Aug 5 13:59:13 MDT 2012


  Good afternoon,

I apologize for the eternal question, "why isn't my job running"
but I'm not sure where to look next. I'm running Torque 4.0.2
that I built on a Scientific Linux 6.2 box.

The job script is,

#!/bin/bash
#PBS -q batch
#PBS -l walltime=00:10:00
#PBS -l nodes=1:ppn=1

date
hostname
sleep 20
date


I submit using qsub and then "qstat -a" looks like,

[laytonjb at test1 TEST]$ qstat -a

test1:
                                                                          Req'd  Req'd   Elap
Job ID               Username    Queue    Jobname          SessID NDS   
TSK    Memory Time  S Time
-------------------- ----------- -------- ---------------- ------ ----- 
------ ------ ----- - -----
11.test1             laytonjb    batch    pbs_test2           --      
1      1    --  00:10 Q   --


It stays like this forever. I looked in the logs and didn't see any
anything obvious. Here is some output that may help.


Server logs:

08/05/2012 16:15:35;0100;PBS_Server;Job;11.test1;enqueuing into batch, 
state 1 hop 1
08/05/2012 16:15:35;0008;PBS_Server;Job;11.test1;Job Queued at request 
of laytonjb at test1, owner = laytonjb at test1, job name = pbs_test2, queue = 
batch


Scheduler logs: (FIFO scheduler):

08/05/2012 15:44:44;0002; pbs_sched;Svr;die;caught signal 15
08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log closed
08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log opened
08/05/2012 15:44:44;0002; pbs_sched;Svr;TokenAct;Account file 
/opt/torque/sched_priv/accounting/20120805 opened
08/05/2012 15:44:44;0002; pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched 
startup pid 4782


pbs_mom logs: (I tried restarting the mom ("service pbs_mom restart") 
and the output is below)

08/05/2012 16:17:28;0002;   pbs_mom;n/a;rm_request;shutdown
08/05/2012 16:17:28;0002;   pbs_mom;n/a;dep_cleanup;dependent cleanup
08/05/2012 16:17:28;0002;   pbs_mom;Svr;Log;Log closed
08/05/2012 16:17:31;0002;   pbs_mom;Svr;Log;Log opened
08/05/2012 16:17:31;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 
4.0.2, loglevel = 0
08/05/2012 16:17:31;0002;   pbs_mom;Svr;setpbsserver;test1
08/05/2012 16:17:31;0002;   pbs_mom;Svr;mom_server_add;server test1 added
08/05/2012 16:17:31;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::No such file 
or directory (2) in check_partition_confirm_script, Couldn't stat the 
partition confirm command 
'/opt/moab/default/tools/xt4/partition.create.xt4.pl 
<http://partition.create.xt4.pl>' - ignore this if you aren't running a cray
08/05/2012 16:17:31;0002;   pbs_mom;n/a;initialize;independent
08/05/2012 16:17:31;0080;   pbs_mom;Svr;pbs_mom;before init_abort_jobs
08/05/2012 16:17:31;0002;   pbs_mom;Svr;pbs_mom;Is up
08/05/2012 16:17:31;0002;   pbs_mom;Svr;setup_program_environment;MOM 
executable path and mtime at launch: /usr/sbin/pbs_mom 1344179259
08/05/2012 16:17:31;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 
4.0.2, loglevel = 0


pbsnodes -a:

[root at test1 mom_logs]# pbsnodes -a
n0001
      state = free
      np = 1
      ntype = cluster
      status = 
rectime=1344197869,varattr=,jobs=,state=free,netload=120587595,gres=,loadave=0.02,ncpus=3,physmem=2956668kb,availmem=2836956kb,totmem=2956668kb,idletime=4196,nusers=1,nsessions=1,sessions=1560,uname=Linux 
n0001 2.6.32-220.el6.x86_64 #1 SMP Sat Dec 10 17:04:11 CST 2011 
x86_64,opsys=linux
      mom_service_port = 15002
      mom_manager_port = 15003
      gpus = 0



qmgr -c "p s":
[root at test1 mom_logs]# qmgr -c "p s"
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = test1
set server managers = laytonjb at test1
set server operators = laytonjb at test1
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 45
set server poll_jobs = True
set server mom_job_sync = True
set server next_job_number = 12
set server moab_array_compatible = True


Not sure where to start looking from here.

TIA!

Jeff

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120805/387370f9/attachment-0001.html 


More information about the torqueusers mailing list