[torqueusers] Job not running

Jeff Layton laytonjb at att.net
Sun Aug 5 17:27:02 MDT 2012


  Just an FYI - the job would run once I used qrun. Does this point
to the scheduler? (I'm just using the default scheduler that comes
with Torque (i.e. not Maui).

Thanks!

Jeff

> Good afternoon,
>
> I apologize for the eternal question, "why isn't my job running"
> but I'm not sure where to look next. I'm running Torque 4.0.2
> that I built on a Scientific Linux 6.2 box.
>
> The job script is,
>
> #!/bin/bash
> #PBS -q batch
> #PBS -l walltime=00:10:00
> #PBS -l nodes=1:ppn=1
>
> date
> hostname
> sleep 20
> date
>
>
> I submit using qsub and then "qstat -a" looks like,
>
> [laytonjb at test1 TEST]$ qstat -a
>
> test1:
>                                                                          
> Req'd  Req'd   Elap
> Job ID               Username    Queue    Jobname          SessID 
> NDS   TSK    Memory Time  S Time
> -------------------- ----------- -------- ---------------- ------ 
> ----- ------ ------ ----- - -----
> 11.test1             laytonjb    batch    pbs_test2           --      
> 1      1    --  00:10 Q   --
>
>
> It stays like this forever. I looked in the logs and didn't see any
> anything obvious. Here is some output that may help.
>
>
> Server logs:
>
> 08/05/2012 16:15:35;0100;PBS_Server;Job;11.test1;enqueuing into batch, 
> state 1 hop 1
> 08/05/2012 16:15:35;0008;PBS_Server;Job;11.test1;Job Queued at request 
> of laytonjb at test1, owner = laytonjb at test1, job name = pbs_test2, queue 
> = batch
>
>
> Scheduler logs: (FIFO scheduler):
>
> 08/05/2012 15:44:44;0002; pbs_sched;Svr;die;caught signal 15
> 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log closed
> 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log opened
> 08/05/2012 15:44:44;0002; pbs_sched;Svr;TokenAct;Account file 
> /opt/torque/sched_priv/accounting/20120805 opened
> 08/05/2012 15:44:44;0002; 
> pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched startup pid 4782
>
>
> pbs_mom logs: (I tried restarting the mom ("service pbs_mom restart") 
> and the output is below)
>
> 08/05/2012 16:17:28;0002;   pbs_mom;n/a;rm_request;shutdown
> 08/05/2012 16:17:28;0002;   pbs_mom;n/a;dep_cleanup;dependent cleanup
> 08/05/2012 16:17:28;0002;   pbs_mom;Svr;Log;Log closed
> 08/05/2012 16:17:31;0002;   pbs_mom;Svr;Log;Log opened
> 08/05/2012 16:17:31;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 
> 4.0.2, loglevel = 0
> 08/05/2012 16:17:31;0002;   pbs_mom;Svr;setpbsserver;test1
> 08/05/2012 16:17:31;0002;   pbs_mom;Svr;mom_server_add;server test1 added
> 08/05/2012 16:17:31;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::No such 
> file or directory (2) in check_partition_confirm_script, Couldn't stat 
> the partition confirm command 
> '/opt/moab/default/tools/xt4/partition.create.xt4.pl 
> <http://partition.create.xt4.pl>' - ignore this if you aren't running 
> a cray
> 08/05/2012 16:17:31;0002;   pbs_mom;n/a;initialize;independent
> 08/05/2012 16:17:31;0080;   pbs_mom;Svr;pbs_mom;before init_abort_jobs
> 08/05/2012 16:17:31;0002;   pbs_mom;Svr;pbs_mom;Is up
> 08/05/2012 16:17:31;0002;   pbs_mom;Svr;setup_program_environment;MOM 
> executable path and mtime at launch: /usr/sbin/pbs_mom 1344179259
> 08/05/2012 16:17:31;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 
> 4.0.2, loglevel = 0
>
>
> pbsnodes -a:
>
> [root at test1 mom_logs]# pbsnodes -a
> n0001
>      state = free
>      np = 1
>      ntype = cluster
>      status = 
> rectime=1344197869,varattr=,jobs=,state=free,netload=120587595,gres=,loadave=0.02,ncpus=3,physmem=2956668kb,availmem=2836956kb,totmem=2956668kb,idletime=4196,nusers=1,nsessions=1,sessions=1560,uname=Linux 
> n0001 2.6.32-220.el6.x86_64 #1 SMP Sat Dec 10 17:04:11 CST 2011 
> x86_64,opsys=linux
>      mom_service_port = 15002
>      mom_manager_port = 15003
>      gpus = 0
>
>
>
> qmgr -c "p s":
> [root at test1 mom_logs]# qmgr -c "p s"
> #
> # Create queues and set their attributes.
> #
> #
> # Create and define queue batch
> #
> create queue batch
> set queue batch queue_type = Execution
> set queue batch resources_default.nodes = 1
> set queue batch resources_default.walltime = 01:00:00
> set queue batch enabled = True
> set queue batch started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server acl_hosts = test1
> set server managers = laytonjb at test1
> set server operators = laytonjb at test1
> set server default_queue = batch
> set server log_events = 511
> set server mail_from = adm
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 300
> set server job_stat_rate = 45
> set server poll_jobs = True
> set server mom_job_sync = True
> set server next_job_number = 12
> set server moab_array_compatible = True
>
>
> Not sure where to start looking from here.
>
> TIA!
>
> Jeff
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120805/6093cccb/attachment.html 


More information about the torqueusers mailing list