[torqueusers] Job not running

Jeff Layton laytonjb at att.net
Mon Aug 6 11:32:55 MDT 2012


  Gus.

Thanks for the email! Everything is run by root and was installed
by root. I tried your suggestions below to add root to the server
manager and operators but that didn't change anything. The jobs
still hang and I can't find out why.

I'm still trying some things but no joy so far. I think the problem is
in the scheduler but I can't seem to locate the problem. It's the
simple FIFO scheduler that is part of Torque so I don't see any
reason why it's holding jobs. The only thing I can think of is that
it doesn't think there are any resources available but I can't
find a reason why.

Thanks!

Jeff


> Hi Jeff
>
> Are you running pbs_server, pbs_mom, pbs_sched as yourself or root?
>
> Would this work?
>
> qmgr -c 'set server managers += root@@test1'
> qmgr -c 'set server operators += root@@test1'
>
>   From the Torque Admin Guide:
>
> http://www.adaptivecomputing.com/resources/docs/torque/4-0-1/help.htm#topics/1-installConfig/installing.htm
>
> "Adaptive Computing recommends that the TORQUE administrator be root.
> For information about customizing the build at configure time, see
> Customizing the install."
>
> "TORQUE must be installed by a root user. If running sudo fails, switch
> to root with su-."
>
> I am not sure if root must reign on Torque, but in the user guide
> there are several other references to commands that need
> to be run by root, processes that need to be owned by root, etc.
> A list subscribing expert could shed some light here.
>
> I hope it helps,
> Gus Correa
>
>
> On 08/06/2012 09:24 AM, Jeff Layton wrote:
>>     André,
>>
>> Thanks! I tried running the job again this morning without
>> the "np=1" and it still hung. I'm attaching the tracejob output
>> as well as the scheduler logs. Note the mom logs are actually
>> named by the compute node name. For example, the first
>> compute node is called n0001 so the mom logs are called
>> n0001.log. I do this since all of the mom logs are in one
>> NFS directory (/opt/torque/mom_logs).
>>
>> I don't know how to raise the log level - can you help with
>> that?
>>
>> Thanks!
>>
>> Jeff
>>
>>
>> Tracejob output:
>> [root at test1 bin]# qstat -a
>>
>> test1:
>>                                                                             Req'd  Req'd   Elap
>> Job ID               Username    Queue    Jobname          SessID NDS
>> TSK    Memory Time  S Time
>> -------------------- ----------- -------- ---------------- ------ -----
>> ------ ------ ----- - -----
>> 13.test1             laytonjb    batch    pbs_test2           --
>> 1      1    --  00:10 Q   --
>> [root at test1 bin]# ./tracejob 13.test1
>> /opt/torque/mom_logs/20120806: No such file or directory
>> /opt/torque/sched_logs/20120806: No matching job records located
>>
>> Job: 13.test1
>>
>> 08/06/2012 09:32:53  S    enqueuing into batch, state 1 hop 1
>> 08/06/2012 09:32:53  S    Job Queued at request of laytonjb at test1, owner
>> = laytonjb at test1, job name = pbs_test2, queue = batch
>> 08/06/2012 09:32:53  A    queue=batch
>>
>>
>>
>>
>> Scheduler Logs:
>> [root at test1 sched_logs]# more 20120806
>> 08/06/2012 09:25:54;0002; pbs_sched;Svr;Log;Log opened
>> 08/06/2012 09:25:54;0002; pbs_sched;Svr;TokenAct;Account file
>> /opt/torque/sched_priv/accounting/20120806 opened
>> 08/06/2012 09:25:54;0002; pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched
>> startup pid 2090
>>
>>
>>
>>> Hi Jeff,
>>>
>>> yes, that it runs with qrun points at the scheduler.
>>> Did the pbs_sched daemon perhaps die? Does it work if you submit with default resources (specifically without ppn)? Otherwise, could you paste us a tracejob and qstat -f. And raise the log_level a bit. I don't see a problem in your logs.
>>>
>>> Greetings
>>> Andre
>>>
>>> ----- Ursprüngliche Mail -----
>>>> Just an FYI - the job would run once I used qrun. Does this point
>>>> to the scheduler? (I'm just using the default scheduler that comes
>>>> with Torque (i.e. not Maui).
>>>>
>>>> Thanks!
>>>>
>>>> Jeff
>>>>
>>>>
>>>>
>>>> Good afternoon,
>>>>
>>>> I apologize for the eternal question, "why isn't my job running"
>>>> but I'm not sure where to look next. I'm running Torque 4.0.2
>>>> that I built on a Scientific Linux 6.2 box.
>>>>
>>>> The job script is,
>>>>
>>>> #!/bin/bash
>>>> #PBS -q batch
>>>> #PBS -l walltime=00:10:00
>>>> #PBS -l nodes=1:ppn=1
>>>>
>>>> date
>>>> hostname
>>>> sleep 20
>>>> date
>>>>
>>>>
>>>> I submit using qsub and then "qstat -a" looks like,
>>>>
>>>> [laytonjb at test1 TEST]$ qstat -a
>>>>
>>>> test1:
>>>> Req'd Req'd Elap
>>>> Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
>>>> -------------------- ----------- -------- ---------------- ------
>>>> ----- ------ ------ ----- - -----
>>>> 11.test1 laytonjb batch pbs_test2 -- 1 1 -- 00:10 Q --
>>>>
>>>>
>>>> It stays like this forever. I looked in the logs and didn't see any
>>>> anything obvious. Here is some output that may help.
>>>>
>>>>
>>>> Server logs:
>>>>
>>>> 08/05/2012 16:15:35;0100;PBS_Server;Job;11.test1;enqueuing into
>>>> batch, state 1 hop 1
>>>> 08/05/2012 16:15:35;0008;PBS_Server;Job;11.test1;Job Queued at
>>>> request of laytonjb at test1, owner = laytonjb at test1, job name =
>>>> pbs_test2, queue = batch
>>>>
>>>>
>>>> Scheduler logs: (FIFO scheduler):
>>>>
>>>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;die;caught signal 15
>>>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log closed
>>>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log opened
>>>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;TokenAct;Account file
>>>> /opt/torque/sched_priv/accounting/20120805 opened
>>>> 08/05/2012 15:44:44;0002;
>>>> pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched startup pid 4782
>>>>
>>>>
>>>> pbs_mom logs: (I tried restarting the mom ("service pbs_mom restart")
>>>> and the output is below)
>>>>
>>>> 08/05/2012 16:17:28;0002; pbs_mom;n/a;rm_request;shutdown
>>>> 08/05/2012 16:17:28;0002; pbs_mom;n/a;dep_cleanup;dependent cleanup
>>>> 08/05/2012 16:17:28;0002; pbs_mom;Svr;Log;Log closed
>>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;Log;Log opened
>>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version =
>>>> 4.0.2, loglevel = 0
>>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;setpbsserver;test1
>>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;mom_server_add;server test1
>>>> added
>>>> 08/05/2012 16:17:31;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::No such file
>>>> or directory (2) in check_partition_confirm_script, Couldn't stat
>>>> the partition confirm command '/opt/moab/default/tools/xt4/
>>>> partition.create.xt4.pl ' - ignore this if you aren't running a cray
>>>> 08/05/2012 16:17:31;0002; pbs_mom;n/a;initialize;independent
>>>> 08/05/2012 16:17:31;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs
>>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Is up
>>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;setup_program_environment;MOM
>>>> executable path and mtime at launch: /usr/sbin/pbs_mom 1344179259
>>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version =
>>>> 4.0.2, loglevel = 0
>>>>
>>>>
>>>> pbsnodes -a:
>>>>
>>>> [root at test1 mom_logs]# pbsnodes -a
>>>> n0001
>>>> state = free
>>>> np = 1
>>>> ntype = cluster
>>>> status =
>>>> rectime=1344197869,varattr=,jobs=,state=free,netload=120587595,gres=,loadave=0.02,ncpus=3,physmem=2956668kb,availmem=2836956kb,totmem=2956668kb,idletime=4196,nusers=1,nsessions=1,sessions=1560,uname=Linux
>>>> n0001 2.6.32-220.el6.x86_64 #1 SMP Sat Dec 10 17:04:11 CST 2011
>>>> x86_64,opsys=linux
>>>> mom_service_port = 15002
>>>> mom_manager_port = 15003
>>>> gpus = 0
>>>>
>>>>
>>>>
>>>> qmgr -c "p s":
>>>> [root at test1 mom_logs]# qmgr -c "p s"
>>>> #
>>>> # Create queues and set their attributes.
>>>> #
>>>> #
>>>> # Create and define queue batch
>>>> #
>>>> create queue batch
>>>> set queue batch queue_type = Execution
>>>> set queue batch resources_default.nodes = 1
>>>> set queue batch resources_default.walltime = 01:00:00
>>>> set queue batch enabled = True
>>>> set queue batch started = True
>>>> #
>>>> # Set server attributes.
>>>> #
>>>> set server scheduling = True
>>>> set server acl_hosts = test1
>>>> set server managers = laytonjb at test1
>>>> set server operators = laytonjb at test1
>>>> set server default_queue = batch
>>>> set server log_events = 511
>>>> set server mail_from = adm
>>>> set server scheduler_iteration = 600
>>>> set server node_check_rate = 150
>>>> set server tcp_timeout = 300
>>>> set server job_stat_rate = 45
>>>> set server poll_jobs = True
>>>> set server mom_job_sync = True
>>>> set server next_job_number = 12
>>>> set server moab_array_compatible = True
>>>>
>>>>
>>>> Not sure where to start looking from here.
>>>>
>>>> TIA!
>>>>
>>>> Jeff
>>>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list