[torqueusers] Job not running

André Gemünd andre.gemuend at scai.fraunhofer.de
Mon Aug 6 00:27:43 MDT 2012


Hi Jeff,

yes, that it runs with qrun points at the scheduler.
Did the pbs_sched daemon perhaps die? Does it work if you submit with default resources (specifically without ppn)? Otherwise, could you paste us a tracejob and qstat -f. And raise the log_level a bit. I don't see a problem in your logs.

Greetings
Andre

----- Ursprüngliche Mail -----
> 
> Just an FYI - the job would run once I used qrun. Does this point
> to the scheduler? (I'm just using the default scheduler that comes
> with Torque (i.e. not Maui).
> 
> Thanks!
> 
> Jeff
> 
> 
> 
> Good afternoon,
> 
> I apologize for the eternal question, "why isn't my job running"
> but I'm not sure where to look next. I'm running Torque 4.0.2
> that I built on a Scientific Linux 6.2 box.
> 
> The job script is,
> 
> #!/bin/bash
> #PBS -q batch
> #PBS -l walltime=00:10:00
> #PBS -l nodes=1:ppn=1
> 
> date
> hostname
> sleep 20
> date
> 
> 
> I submit using qsub and then "qstat -a" looks like,
> 
> [laytonjb at test1 TEST]$ qstat -a
> 
> test1:
> Req'd Req'd Elap
> Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
> -------------------- ----------- -------- ---------------- ------
> ----- ------ ------ ----- - -----
> 11.test1 laytonjb batch pbs_test2 -- 1 1 -- 00:10 Q --
> 
> 
> It stays like this forever. I looked in the logs and didn't see any
> anything obvious. Here is some output that may help.
> 
> 
> Server logs:
> 
> 08/05/2012 16:15:35;0100;PBS_Server;Job;11.test1;enqueuing into
> batch, state 1 hop 1
> 08/05/2012 16:15:35;0008;PBS_Server;Job;11.test1;Job Queued at
> request of laytonjb at test1, owner = laytonjb at test1, job name =
> pbs_test2, queue = batch
> 
> 
> Scheduler logs: (FIFO scheduler):
> 
> 08/05/2012 15:44:44;0002; pbs_sched;Svr;die;caught signal 15
> 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log closed
> 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log opened
> 08/05/2012 15:44:44;0002; pbs_sched;Svr;TokenAct;Account file
> /opt/torque/sched_priv/accounting/20120805 opened
> 08/05/2012 15:44:44;0002;
> pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched startup pid 4782
> 
> 
> pbs_mom logs: (I tried restarting the mom ("service pbs_mom restart")
> and the output is below)
> 
> 08/05/2012 16:17:28;0002; pbs_mom;n/a;rm_request;shutdown
> 08/05/2012 16:17:28;0002; pbs_mom;n/a;dep_cleanup;dependent cleanup
> 08/05/2012 16:17:28;0002; pbs_mom;Svr;Log;Log closed
> 08/05/2012 16:17:31;0002; pbs_mom;Svr;Log;Log opened
> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 4.0.2, loglevel = 0
> 08/05/2012 16:17:31;0002; pbs_mom;Svr;setpbsserver;test1
> 08/05/2012 16:17:31;0002; pbs_mom;Svr;mom_server_add;server test1
> added
> 08/05/2012 16:17:31;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::No such file
> or directory (2) in check_partition_confirm_script, Couldn't stat
> the partition confirm command '/opt/moab/default/tools/xt4/
> partition.create.xt4.pl ' - ignore this if you aren't running a cray
> 08/05/2012 16:17:31;0002; pbs_mom;n/a;initialize;independent
> 08/05/2012 16:17:31;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs
> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Is up
> 08/05/2012 16:17:31;0002; pbs_mom;Svr;setup_program_environment;MOM
> executable path and mtime at launch: /usr/sbin/pbs_mom 1344179259
> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 4.0.2, loglevel = 0
> 
> 
> pbsnodes -a:
> 
> [root at test1 mom_logs]# pbsnodes -a
> n0001
> state = free
> np = 1
> ntype = cluster
> status =
> rectime=1344197869,varattr=,jobs=,state=free,netload=120587595,gres=,loadave=0.02,ncpus=3,physmem=2956668kb,availmem=2836956kb,totmem=2956668kb,idletime=4196,nusers=1,nsessions=1,sessions=1560,uname=Linux
> n0001 2.6.32-220.el6.x86_64 #1 SMP Sat Dec 10 17:04:11 CST 2011
> x86_64,opsys=linux
> mom_service_port = 15002
> mom_manager_port = 15003
> gpus = 0
> 
> 
> 
> qmgr -c "p s":
> [root at test1 mom_logs]# qmgr -c "p s"
> #
> # Create queues and set their attributes.
> #
> #
> # Create and define queue batch
> #
> create queue batch
> set queue batch queue_type = Execution
> set queue batch resources_default.nodes = 1
> set queue batch resources_default.walltime = 01:00:00
> set queue batch enabled = True
> set queue batch started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server acl_hosts = test1
> set server managers = laytonjb at test1
> set server operators = laytonjb at test1
> set server default_queue = batch
> set server log_events = 511
> set server mail_from = adm
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 300
> set server job_stat_rate = 45
> set server poll_jobs = True
> set server mom_job_sync = True
> set server next_job_number = 12
> set server moab_array_compatible = True
> 
> 
> Not sure where to start looking from here.
> 
> TIA!
> 
> Jeff
> 
> _______________________________________________
> torqueusers mailing list torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 

-- 
André Gemünd
Fraunhofer-Institute for Algorithms and Scientific Computing
andre.gemuend at scai.fraunhofer.de
Tel: +49 2241 14-2193
/C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend


More information about the torqueusers mailing list