[torqueusers] Job not running
Jeff Layton
laytonjb at att.net
Sun Aug 5 13:59:13 MDT 2012
Good afternoon,
I apologize for the eternal question, "why isn't my job running"
but I'm not sure where to look next. I'm running Torque 4.0.2
that I built on a Scientific Linux 6.2 box.
The job script is,
#!/bin/bash
#PBS -q batch
#PBS -l walltime=00:10:00
#PBS -l nodes=1:ppn=1
date
hostname
sleep 20
date
I submit using qsub and then "qstat -a" looks like,
[laytonjb at test1 TEST]$ qstat -a
test1:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS
TSK Memory Time S Time
-------------------- ----------- -------- ---------------- ------ -----
------ ------ ----- - -----
11.test1 laytonjb batch pbs_test2 --
1 1 -- 00:10 Q --
It stays like this forever. I looked in the logs and didn't see any
anything obvious. Here is some output that may help.
Server logs:
08/05/2012 16:15:35;0100;PBS_Server;Job;11.test1;enqueuing into batch,
state 1 hop 1
08/05/2012 16:15:35;0008;PBS_Server;Job;11.test1;Job Queued at request
of laytonjb at test1, owner = laytonjb at test1, job name = pbs_test2, queue =
batch
Scheduler logs: (FIFO scheduler):
08/05/2012 15:44:44;0002; pbs_sched;Svr;die;caught signal 15
08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log closed
08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log opened
08/05/2012 15:44:44;0002; pbs_sched;Svr;TokenAct;Account file
/opt/torque/sched_priv/accounting/20120805 opened
08/05/2012 15:44:44;0002; pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched
startup pid 4782
pbs_mom logs: (I tried restarting the mom ("service pbs_mom restart")
and the output is below)
08/05/2012 16:17:28;0002; pbs_mom;n/a;rm_request;shutdown
08/05/2012 16:17:28;0002; pbs_mom;n/a;dep_cleanup;dependent cleanup
08/05/2012 16:17:28;0002; pbs_mom;Svr;Log;Log closed
08/05/2012 16:17:31;0002; pbs_mom;Svr;Log;Log opened
08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version =
4.0.2, loglevel = 0
08/05/2012 16:17:31;0002; pbs_mom;Svr;setpbsserver;test1
08/05/2012 16:17:31;0002; pbs_mom;Svr;mom_server_add;server test1 added
08/05/2012 16:17:31;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::No such file
or directory (2) in check_partition_confirm_script, Couldn't stat the
partition confirm command
'/opt/moab/default/tools/xt4/partition.create.xt4.pl
<http://partition.create.xt4.pl>' - ignore this if you aren't running a cray
08/05/2012 16:17:31;0002; pbs_mom;n/a;initialize;independent
08/05/2012 16:17:31;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs
08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Is up
08/05/2012 16:17:31;0002; pbs_mom;Svr;setup_program_environment;MOM
executable path and mtime at launch: /usr/sbin/pbs_mom 1344179259
08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version =
4.0.2, loglevel = 0
pbsnodes -a:
[root at test1 mom_logs]# pbsnodes -a
n0001
state = free
np = 1
ntype = cluster
status =
rectime=1344197869,varattr=,jobs=,state=free,netload=120587595,gres=,loadave=0.02,ncpus=3,physmem=2956668kb,availmem=2836956kb,totmem=2956668kb,idletime=4196,nusers=1,nsessions=1,sessions=1560,uname=Linux
n0001 2.6.32-220.el6.x86_64 #1 SMP Sat Dec 10 17:04:11 CST 2011
x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
qmgr -c "p s":
[root at test1 mom_logs]# qmgr -c "p s"
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = test1
set server managers = laytonjb at test1
set server operators = laytonjb at test1
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 45
set server poll_jobs = True
set server mom_job_sync = True
set server next_job_number = 12
set server moab_array_compatible = True
Not sure where to start looking from here.
TIA!
Jeff
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120805/387370f9/attachment-0001.html
More information about the torqueusers
mailing list