[torqueusers] Job not running

Gus Correa gus at ldeo.columbia.edu
Mon Aug 6 09:58:25 MDT 2012


Hi Jeff

Are you running pbs_server, pbs_mom, pbs_sched as yourself or root?

Would this work?

qmgr -c 'set server managers += root@@test1'
qmgr -c 'set server operators += root@@test1'

 From the Torque Admin Guide:

http://www.adaptivecomputing.com/resources/docs/torque/4-0-1/help.htm#topics/1-installConfig/installing.htm

"Adaptive Computing recommends that the TORQUE administrator be root.
For information about customizing the build at configure time, see 
Customizing the install."

"TORQUE must be installed by a root user. If running sudo fails, switch 
to root with su-."

I am not sure if root must reign on Torque, but in the user guide
there are several other references to commands that need
to be run by root, processes that need to be owned by root, etc.
A list subscribing expert could shed some light here.

I hope it helps,
Gus Correa


On 08/06/2012 09:24 AM, Jeff Layton wrote:
>    André,
>
> Thanks! I tried running the job again this morning without
> the "np=1" and it still hung. I'm attaching the tracejob output
> as well as the scheduler logs. Note the mom logs are actually
> named by the compute node name. For example, the first
> compute node is called n0001 so the mom logs are called
> n0001.log. I do this since all of the mom logs are in one
> NFS directory (/opt/torque/mom_logs).
>
> I don't know how to raise the log level - can you help with
> that?
>
> Thanks!
>
> Jeff
>
>
> Tracejob output:
> [root at test1 bin]# qstat -a
>
> test1:
>                                                                            Req'd  Req'd   Elap
> Job ID               Username    Queue    Jobname          SessID NDS
> TSK    Memory Time  S Time
> -------------------- ----------- -------- ---------------- ------ -----
> ------ ------ ----- - -----
> 13.test1             laytonjb    batch    pbs_test2           --
> 1      1    --  00:10 Q   --
> [root at test1 bin]# ./tracejob 13.test1
> /opt/torque/mom_logs/20120806: No such file or directory
> /opt/torque/sched_logs/20120806: No matching job records located
>
> Job: 13.test1
>
> 08/06/2012 09:32:53  S    enqueuing into batch, state 1 hop 1
> 08/06/2012 09:32:53  S    Job Queued at request of laytonjb at test1, owner
> = laytonjb at test1, job name = pbs_test2, queue = batch
> 08/06/2012 09:32:53  A    queue=batch
>
>
>
>
> Scheduler Logs:
> [root at test1 sched_logs]# more 20120806
> 08/06/2012 09:25:54;0002; pbs_sched;Svr;Log;Log opened
> 08/06/2012 09:25:54;0002; pbs_sched;Svr;TokenAct;Account file
> /opt/torque/sched_priv/accounting/20120806 opened
> 08/06/2012 09:25:54;0002; pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched
> startup pid 2090
>
>
>
>> Hi Jeff,
>>
>> yes, that it runs with qrun points at the scheduler.
>> Did the pbs_sched daemon perhaps die? Does it work if you submit with default resources (specifically without ppn)? Otherwise, could you paste us a tracejob and qstat -f. And raise the log_level a bit. I don't see a problem in your logs.
>>
>> Greetings
>> Andre
>>
>> ----- Ursprüngliche Mail -----
>>> Just an FYI - the job would run once I used qrun. Does this point
>>> to the scheduler? (I'm just using the default scheduler that comes
>>> with Torque (i.e. not Maui).
>>>
>>> Thanks!
>>>
>>> Jeff
>>>
>>>
>>>
>>> Good afternoon,
>>>
>>> I apologize for the eternal question, "why isn't my job running"
>>> but I'm not sure where to look next. I'm running Torque 4.0.2
>>> that I built on a Scientific Linux 6.2 box.
>>>
>>> The job script is,
>>>
>>> #!/bin/bash
>>> #PBS -q batch
>>> #PBS -l walltime=00:10:00
>>> #PBS -l nodes=1:ppn=1
>>>
>>> date
>>> hostname
>>> sleep 20
>>> date
>>>
>>>
>>> I submit using qsub and then "qstat -a" looks like,
>>>
>>> [laytonjb at test1 TEST]$ qstat -a
>>>
>>> test1:
>>> Req'd Req'd Elap
>>> Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
>>> -------------------- ----------- -------- ---------------- ------
>>> ----- ------ ------ ----- - -----
>>> 11.test1 laytonjb batch pbs_test2 -- 1 1 -- 00:10 Q --
>>>
>>>
>>> It stays like this forever. I looked in the logs and didn't see any
>>> anything obvious. Here is some output that may help.
>>>
>>>
>>> Server logs:
>>>
>>> 08/05/2012 16:15:35;0100;PBS_Server;Job;11.test1;enqueuing into
>>> batch, state 1 hop 1
>>> 08/05/2012 16:15:35;0008;PBS_Server;Job;11.test1;Job Queued at
>>> request of laytonjb at test1, owner = laytonjb at test1, job name =
>>> pbs_test2, queue = batch
>>>
>>>
>>> Scheduler logs: (FIFO scheduler):
>>>
>>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;die;caught signal 15
>>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log closed
>>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log opened
>>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;TokenAct;Account file
>>> /opt/torque/sched_priv/accounting/20120805 opened
>>> 08/05/2012 15:44:44;0002;
>>> pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched startup pid 4782
>>>
>>>
>>> pbs_mom logs: (I tried restarting the mom ("service pbs_mom restart")
>>> and the output is below)
>>>
>>> 08/05/2012 16:17:28;0002; pbs_mom;n/a;rm_request;shutdown
>>> 08/05/2012 16:17:28;0002; pbs_mom;n/a;dep_cleanup;dependent cleanup
>>> 08/05/2012 16:17:28;0002; pbs_mom;Svr;Log;Log closed
>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;Log;Log opened
>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version =
>>> 4.0.2, loglevel = 0
>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;setpbsserver;test1
>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;mom_server_add;server test1
>>> added
>>> 08/05/2012 16:17:31;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::No such file
>>> or directory (2) in check_partition_confirm_script, Couldn't stat
>>> the partition confirm command '/opt/moab/default/tools/xt4/
>>> partition.create.xt4.pl ' - ignore this if you aren't running a cray
>>> 08/05/2012 16:17:31;0002; pbs_mom;n/a;initialize;independent
>>> 08/05/2012 16:17:31;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs
>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Is up
>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;setup_program_environment;MOM
>>> executable path and mtime at launch: /usr/sbin/pbs_mom 1344179259
>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version =
>>> 4.0.2, loglevel = 0
>>>
>>>
>>> pbsnodes -a:
>>>
>>> [root at test1 mom_logs]# pbsnodes -a
>>> n0001
>>> state = free
>>> np = 1
>>> ntype = cluster
>>> status =
>>> rectime=1344197869,varattr=,jobs=,state=free,netload=120587595,gres=,loadave=0.02,ncpus=3,physmem=2956668kb,availmem=2836956kb,totmem=2956668kb,idletime=4196,nusers=1,nsessions=1,sessions=1560,uname=Linux
>>> n0001 2.6.32-220.el6.x86_64 #1 SMP Sat Dec 10 17:04:11 CST 2011
>>> x86_64,opsys=linux
>>> mom_service_port = 15002
>>> mom_manager_port = 15003
>>> gpus = 0
>>>
>>>
>>>
>>> qmgr -c "p s":
>>> [root at test1 mom_logs]# qmgr -c "p s"
>>> #
>>> # Create queues and set their attributes.
>>> #
>>> #
>>> # Create and define queue batch
>>> #
>>> create queue batch
>>> set queue batch queue_type = Execution
>>> set queue batch resources_default.nodes = 1
>>> set queue batch resources_default.walltime = 01:00:00
>>> set queue batch enabled = True
>>> set queue batch started = True
>>> #
>>> # Set server attributes.
>>> #
>>> set server scheduling = True
>>> set server acl_hosts = test1
>>> set server managers = laytonjb at test1
>>> set server operators = laytonjb at test1
>>> set server default_queue = batch
>>> set server log_events = 511
>>> set server mail_from = adm
>>> set server scheduler_iteration = 600
>>> set server node_check_rate = 150
>>> set server tcp_timeout = 300
>>> set server job_stat_rate = 45
>>> set server poll_jobs = True
>>> set server mom_job_sync = True
>>> set server next_job_number = 12
>>> set server moab_array_compatible = True
>>>
>>>
>>> Not sure where to start looking from here.
>>>
>>> TIA!
>>>
>>> Jeff
>>>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list