[torqueusers] Job not running
Jeff Layton
laytonjb at att.net
Sun Aug 5 17:27:02 MDT 2012
Just an FYI - the job would run once I used qrun. Does this point
to the scheduler? (I'm just using the default scheduler that comes
with Torque (i.e. not Maui).
Thanks!
Jeff
> Good afternoon,
>
> I apologize for the eternal question, "why isn't my job running"
> but I'm not sure where to look next. I'm running Torque 4.0.2
> that I built on a Scientific Linux 6.2 box.
>
> The job script is,
>
> #!/bin/bash
> #PBS -q batch
> #PBS -l walltime=00:10:00
> #PBS -l nodes=1:ppn=1
>
> date
> hostname
> sleep 20
> date
>
>
> I submit using qsub and then "qstat -a" looks like,
>
> [laytonjb at test1 TEST]$ qstat -a
>
> test1:
>
> Req'd Req'd Elap
> Job ID Username Queue Jobname SessID
> NDS TSK Memory Time S Time
> -------------------- ----------- -------- ---------------- ------
> ----- ------ ------ ----- - -----
> 11.test1 laytonjb batch pbs_test2 --
> 1 1 -- 00:10 Q --
>
>
> It stays like this forever. I looked in the logs and didn't see any
> anything obvious. Here is some output that may help.
>
>
> Server logs:
>
> 08/05/2012 16:15:35;0100;PBS_Server;Job;11.test1;enqueuing into batch,
> state 1 hop 1
> 08/05/2012 16:15:35;0008;PBS_Server;Job;11.test1;Job Queued at request
> of laytonjb at test1, owner = laytonjb at test1, job name = pbs_test2, queue
> = batch
>
>
> Scheduler logs: (FIFO scheduler):
>
> 08/05/2012 15:44:44;0002; pbs_sched;Svr;die;caught signal 15
> 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log closed
> 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log opened
> 08/05/2012 15:44:44;0002; pbs_sched;Svr;TokenAct;Account file
> /opt/torque/sched_priv/accounting/20120805 opened
> 08/05/2012 15:44:44;0002;
> pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched startup pid 4782
>
>
> pbs_mom logs: (I tried restarting the mom ("service pbs_mom restart")
> and the output is below)
>
> 08/05/2012 16:17:28;0002; pbs_mom;n/a;rm_request;shutdown
> 08/05/2012 16:17:28;0002; pbs_mom;n/a;dep_cleanup;dependent cleanup
> 08/05/2012 16:17:28;0002; pbs_mom;Svr;Log;Log closed
> 08/05/2012 16:17:31;0002; pbs_mom;Svr;Log;Log opened
> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 4.0.2, loglevel = 0
> 08/05/2012 16:17:31;0002; pbs_mom;Svr;setpbsserver;test1
> 08/05/2012 16:17:31;0002; pbs_mom;Svr;mom_server_add;server test1 added
> 08/05/2012 16:17:31;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::No such
> file or directory (2) in check_partition_confirm_script, Couldn't stat
> the partition confirm command
> '/opt/moab/default/tools/xt4/partition.create.xt4.pl
> <http://partition.create.xt4.pl>' - ignore this if you aren't running
> a cray
> 08/05/2012 16:17:31;0002; pbs_mom;n/a;initialize;independent
> 08/05/2012 16:17:31;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs
> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Is up
> 08/05/2012 16:17:31;0002; pbs_mom;Svr;setup_program_environment;MOM
> executable path and mtime at launch: /usr/sbin/pbs_mom 1344179259
> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 4.0.2, loglevel = 0
>
>
> pbsnodes -a:
>
> [root at test1 mom_logs]# pbsnodes -a
> n0001
> state = free
> np = 1
> ntype = cluster
> status =
> rectime=1344197869,varattr=,jobs=,state=free,netload=120587595,gres=,loadave=0.02,ncpus=3,physmem=2956668kb,availmem=2836956kb,totmem=2956668kb,idletime=4196,nusers=1,nsessions=1,sessions=1560,uname=Linux
> n0001 2.6.32-220.el6.x86_64 #1 SMP Sat Dec 10 17:04:11 CST 2011
> x86_64,opsys=linux
> mom_service_port = 15002
> mom_manager_port = 15003
> gpus = 0
>
>
>
> qmgr -c "p s":
> [root at test1 mom_logs]# qmgr -c "p s"
> #
> # Create queues and set their attributes.
> #
> #
> # Create and define queue batch
> #
> create queue batch
> set queue batch queue_type = Execution
> set queue batch resources_default.nodes = 1
> set queue batch resources_default.walltime = 01:00:00
> set queue batch enabled = True
> set queue batch started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server acl_hosts = test1
> set server managers = laytonjb at test1
> set server operators = laytonjb at test1
> set server default_queue = batch
> set server log_events = 511
> set server mail_from = adm
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 300
> set server job_stat_rate = 45
> set server poll_jobs = True
> set server mom_job_sync = True
> set server next_job_number = 12
> set server moab_array_compatible = True
>
>
> Not sure where to start looking from here.
>
> TIA!
>
> Jeff
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120805/6093cccb/attachment.html
More information about the torqueusers
mailing list