[torqueusers] Job not running

André Gemünd andre.gemuend at scai.fraunhofer.de
Tue Aug 7 01:49:56 MDT 2012


Hi Jeff,

please do a 

qmgr -c 'set server log_level = 7'

and try again. Perhaps we can get some more information about the problem then.
And please, send a qstat -f, not -a. :)

Greetings
André

----- Ursprüngliche Mail -----
> Gus.
> 
> Thanks for the email! Everything is run by root and was installed
> by root. I tried your suggestions below to add root to the server
> manager and operators but that didn't change anything. The jobs
> still hang and I can't find out why.
> 
> I'm still trying some things but no joy so far. I think the problem
> is
> in the scheduler but I can't seem to locate the problem. It's the
> simple FIFO scheduler that is part of Torque so I don't see any
> reason why it's holding jobs. The only thing I can think of is that
> it doesn't think there are any resources available but I can't
> find a reason why.
> 
> Thanks!
> 
> Jeff
> 
> 
> > Hi Jeff
> >
> > Are you running pbs_server, pbs_mom, pbs_sched as yourself or root?
> >
> > Would this work?
> >
> > qmgr -c 'set server managers += root@@test1'
> > qmgr -c 'set server operators += root@@test1'
> >
> >   From the Torque Admin Guide:
> >
> > http://www.adaptivecomputing.com/resources/docs/torque/4-0-1/help.htm#topics/1-installConfig/installing.htm
> >
> > "Adaptive Computing recommends that the TORQUE administrator be
> > root.
> > For information about customizing the build at configure time, see
> > Customizing the install."
> >
> > "TORQUE must be installed by a root user. If running sudo fails,
> > switch
> > to root with su-."
> >
> > I am not sure if root must reign on Torque, but in the user guide
> > there are several other references to commands that need
> > to be run by root, processes that need to be owned by root, etc.
> > A list subscribing expert could shed some light here.
> >
> > I hope it helps,
> > Gus Correa
> >
> >
> > On 08/06/2012 09:24 AM, Jeff Layton wrote:
> >>     André,
> >>
> >> Thanks! I tried running the job again this morning without
> >> the "np=1" and it still hung. I'm attaching the tracejob output
> >> as well as the scheduler logs. Note the mom logs are actually
> >> named by the compute node name. For example, the first
> >> compute node is called n0001 so the mom logs are called
> >> n0001.log. I do this since all of the mom logs are in one
> >> NFS directory (/opt/torque/mom_logs).
> >>
> >> I don't know how to raise the log level - can you help with
> >> that?
> >>
> >> Thanks!
> >>
> >> Jeff
> >>
> >>
> >> Tracejob output:
> >> [root at test1 bin]# qstat -a
> >>
> >> test1:
> >>                                                                             Req'd
> >>                                                                              Req'd
> >>                                                                               Elap
> >> Job ID               Username    Queue    Jobname          SessID
> >> NDS
> >> TSK    Memory Time  S Time
> >> -------------------- ----------- -------- ---------------- ------
> >> -----
> >> ------ ------ ----- - -----
> >> 13.test1             laytonjb    batch    pbs_test2           --
> >> 1      1    --  00:10 Q   --
> >> [root at test1 bin]# ./tracejob 13.test1
> >> /opt/torque/mom_logs/20120806: No such file or directory
> >> /opt/torque/sched_logs/20120806: No matching job records located
> >>
> >> Job: 13.test1
> >>
> >> 08/06/2012 09:32:53  S    enqueuing into batch, state 1 hop 1
> >> 08/06/2012 09:32:53  S    Job Queued at request of laytonjb at test1,
> >> owner
> >> = laytonjb at test1, job name = pbs_test2, queue = batch
> >> 08/06/2012 09:32:53  A    queue=batch
> >>
> >>
> >>
> >>
> >> Scheduler Logs:
> >> [root at test1 sched_logs]# more 20120806
> >> 08/06/2012 09:25:54;0002; pbs_sched;Svr;Log;Log opened
> >> 08/06/2012 09:25:54;0002; pbs_sched;Svr;TokenAct;Account file
> >> /opt/torque/sched_priv/accounting/20120806 opened
> >> 08/06/2012 09:25:54;0002;
> >> pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched
> >> startup pid 2090
> >>
> >>
> >>
> >>> Hi Jeff,
> >>>
> >>> yes, that it runs with qrun points at the scheduler.
> >>> Did the pbs_sched daemon perhaps die? Does it work if you submit
> >>> with default resources (specifically without ppn)? Otherwise,
> >>> could you paste us a tracejob and qstat -f. And raise the
> >>> log_level a bit. I don't see a problem in your logs.
> >>>
> >>> Greetings
> >>> Andre
> >>>
> >>> ----- Ursprüngliche Mail -----
> >>>> Just an FYI - the job would run once I used qrun. Does this
> >>>> point
> >>>> to the scheduler? (I'm just using the default scheduler that
> >>>> comes
> >>>> with Torque (i.e. not Maui).
> >>>>
> >>>> Thanks!
> >>>>
> >>>> Jeff
> >>>>
> >>>>
> >>>>
> >>>> Good afternoon,
> >>>>
> >>>> I apologize for the eternal question, "why isn't my job running"
> >>>> but I'm not sure where to look next. I'm running Torque 4.0.2
> >>>> that I built on a Scientific Linux 6.2 box.
> >>>>
> >>>> The job script is,
> >>>>
> >>>> #!/bin/bash
> >>>> #PBS -q batch
> >>>> #PBS -l walltime=00:10:00
> >>>> #PBS -l nodes=1:ppn=1
> >>>>
> >>>> date
> >>>> hostname
> >>>> sleep 20
> >>>> date
> >>>>
> >>>>
> >>>> I submit using qsub and then "qstat -a" looks like,
> >>>>
> >>>> [laytonjb at test1 TEST]$ qstat -a
> >>>>
> >>>> test1:
> >>>> Req'd Req'd Elap
> >>>> Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
> >>>> -------------------- ----------- -------- ----------------
> >>>> ------
> >>>> ----- ------ ------ ----- - -----
> >>>> 11.test1 laytonjb batch pbs_test2 -- 1 1 -- 00:10 Q --
> >>>>
> >>>>
> >>>> It stays like this forever. I looked in the logs and didn't see
> >>>> any
> >>>> anything obvious. Here is some output that may help.
> >>>>
> >>>>
> >>>> Server logs:
> >>>>
> >>>> 08/05/2012 16:15:35;0100;PBS_Server;Job;11.test1;enqueuing into
> >>>> batch, state 1 hop 1
> >>>> 08/05/2012 16:15:35;0008;PBS_Server;Job;11.test1;Job Queued at
> >>>> request of laytonjb at test1, owner = laytonjb at test1, job name =
> >>>> pbs_test2, queue = batch
> >>>>
> >>>>
> >>>> Scheduler logs: (FIFO scheduler):
> >>>>
> >>>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;die;caught signal 15
> >>>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log closed
> >>>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log opened
> >>>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;TokenAct;Account file
> >>>> /opt/torque/sched_priv/accounting/20120805 opened
> >>>> 08/05/2012 15:44:44;0002;
> >>>> pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched startup pid 4782
> >>>>
> >>>>
> >>>> pbs_mom logs: (I tried restarting the mom ("service pbs_mom
> >>>> restart")
> >>>> and the output is below)
> >>>>
> >>>> 08/05/2012 16:17:28;0002; pbs_mom;n/a;rm_request;shutdown
> >>>> 08/05/2012 16:17:28;0002; pbs_mom;n/a;dep_cleanup;dependent
> >>>> cleanup
> >>>> 08/05/2012 16:17:28;0002; pbs_mom;Svr;Log;Log closed
> >>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;Log;Log opened
> >>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version
> >>>> =
> >>>> 4.0.2, loglevel = 0
> >>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;setpbsserver;test1
> >>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;mom_server_add;server
> >>>> test1
> >>>> added
> >>>> 08/05/2012 16:17:31;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::No such
> >>>> file
> >>>> or directory (2) in check_partition_confirm_script, Couldn't
> >>>> stat
> >>>> the partition confirm command '/opt/moab/default/tools/xt4/
> >>>> partition.create.xt4.pl ' - ignore this if you aren't running a
> >>>> cray
> >>>> 08/05/2012 16:17:31;0002; pbs_mom;n/a;initialize;independent
> >>>> 08/05/2012 16:17:31;0080; pbs_mom;Svr;pbs_mom;before
> >>>> init_abort_jobs
> >>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Is up
> >>>> 08/05/2012 16:17:31;0002;
> >>>> pbs_mom;Svr;setup_program_environment;MOM
> >>>> executable path and mtime at launch: /usr/sbin/pbs_mom
> >>>> 1344179259
> >>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version
> >>>> =
> >>>> 4.0.2, loglevel = 0
> >>>>
> >>>>
> >>>> pbsnodes -a:
> >>>>
> >>>> [root at test1 mom_logs]# pbsnodes -a
> >>>> n0001
> >>>> state = free
> >>>> np = 1
> >>>> ntype = cluster
> >>>> status =
> >>>> rectime=1344197869,varattr=,jobs=,state=free,netload=120587595,gres=,loadave=0.02,ncpus=3,physmem=2956668kb,availmem=2836956kb,totmem=2956668kb,idletime=4196,nusers=1,nsessions=1,sessions=1560,uname=Linux
> >>>> n0001 2.6.32-220.el6.x86_64 #1 SMP Sat Dec 10 17:04:11 CST 2011
> >>>> x86_64,opsys=linux
> >>>> mom_service_port = 15002
> >>>> mom_manager_port = 15003
> >>>> gpus = 0
> >>>>
> >>>>
> >>>>
> >>>> qmgr -c "p s":
> >>>> [root at test1 mom_logs]# qmgr -c "p s"
> >>>> #
> >>>> # Create queues and set their attributes.
> >>>> #
> >>>> #
> >>>> # Create and define queue batch
> >>>> #
> >>>> create queue batch
> >>>> set queue batch queue_type = Execution
> >>>> set queue batch resources_default.nodes = 1
> >>>> set queue batch resources_default.walltime = 01:00:00
> >>>> set queue batch enabled = True
> >>>> set queue batch started = True
> >>>> #
> >>>> # Set server attributes.
> >>>> #
> >>>> set server scheduling = True
> >>>> set server acl_hosts = test1
> >>>> set server managers = laytonjb at test1
> >>>> set server operators = laytonjb at test1
> >>>> set server default_queue = batch
> >>>> set server log_events = 511
> >>>> set server mail_from = adm
> >>>> set server scheduler_iteration = 600
> >>>> set server node_check_rate = 150
> >>>> set server tcp_timeout = 300
> >>>> set server job_stat_rate = 45
> >>>> set server poll_jobs = True
> >>>> set server mom_job_sync = True
> >>>> set server next_job_number = 12
> >>>> set server moab_array_compatible = True
> >>>>
> >>>>
> >>>> Not sure where to start looking from here.
> >>>>
> >>>> TIA!
> >>>>
> >>>> Jeff
> >>>>
> >> _______________________________________________
> >> torqueusers mailing list
> >> torqueusers at supercluster.org
> >> http://www.supercluster.org/mailman/listinfo/torqueusers
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 

-- 
André Gemünd
Fraunhofer-Institute for Algorithms and Scientific Computing
andre.gemuend at scai.fraunhofer.de
Tel: +49 2241 14-2193
/C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend


More information about the torqueusers mailing list