[torqueusers] jobs sit in queue forever: torque Newbie

Ken Nielson knielson at adaptivecomputing.com
Tue Sep 3 09:13:12 MDT 2013


What are you using for your scheduler. Are you using pbs_sched? Maui? If
you do not have a scheduler running nothing will happen.

Next, make sure your queues are enabled and that scheduling is also
enabled. You can check this with qmgr -c 'p s'

Regards

Ken


On Mon, Sep 2, 2013 at 3:34 PM, Hodgess, Erin <HodgessE at uhd.edu> wrote:

>  Hello everyone!
>
> I have torque installed (I hope) on my Ubuntu laptop successfully but the
> jobs just sit in the queue forever.
>
> Here is the output from the server_log:
> 09/02/2013 16:23:12;0002;PBS_Server.2412;Svr;Log;Log opened
> 09/02/2013 16:23:12;0006;PBS_Server.2412;Svr;PBS_Server;Server localhost
> started, initialization type = 1
> 09/02/2013
> 16:23:12;0002;PBS_Server.2412;Svr;get_default_threads;Defaulting
> min_threads to 17 threads
> 09/02/2013 16:23:12;0002;PBS_Server.2412;Svr;Act;Account file
> /var/spool/torque/server_priv/accounting/20130902 opened
> 09/02/2013 16:23:12;0040;PBS_Server.2412;Req;setup_nodes;setup_nodes()
> 09/02/2013 16:23:12;0086;PBS_Server.2412;Svr;PBS_Server;Recovered queue
> batch
> 09/02/2013 16:23:12;0002;PBS_Server.2412;Svr;PBS_Server;Expected 1,
> recovered 1 queues
> 09/02/2013 16:23:12;0080;PBS_Server.2412;Svr;PBS_Server;8 total files read
> from disk
> 09/02/2013
> 16:23:12;0100;PBS_Server.2412;Job;3.erin-Lenovo-IdeaPad-Y480;enqueuing into
> batch, state 1 hop 1
> 09/02/2013
> 16:23:12;0086;PBS_Server.2412;Job;3.erin-Lenovo-IdeaPad-Y480;Requeueing
> job, substate: 10 Requeued in queue: batch
> 09/02/2013
> 16:23:12;0100;PBS_Server.2412;Job;5.erin-Lenovo-IdeaPad-Y480;enqueuing into
> batch, state 1 hop 1
> 09/02/2013
> 16:23:12;0086;PBS_Server.2412;Job;5.erin-Lenovo-IdeaPad-Y480;Requeueing
> job, substate: 10 Requeued in queue: batch
> 09/02/2013 16:23:12;0100;PBS_Server.2412;Job;6.localhost;enqueuing into
> batch, state 1 hop 1
> 09/02/2013 16:23:12;0086;PBS_Server.2412;Job;6.localhost;Requeueing job,
> substate: 10 Requeued in queue: batch
> 09/02/2013
> 16:23:12;0002;PBS_Server.2412;Svr;PBS_Server;handle_job_recovery:3
> 09/02/2013 16:23:12;0006;PBS_Server.2412;Svr;PBS_Server;Using ports
> Server:15001  Scheduler:15004  MOM:15002 (server: 'localhost')
> 09/02/2013 16:23:12;0002;PBS_Server.2412;Svr;PBS_Server;Server Ready, pid
> = 2412, loglevel=0
> 09/02/2013
> 16:23:12;0001;PBS_Server.2422;Svr;PBS_Server;LOG_ERROR::Operation now in
> progress (115) in tcp_connect_sockaddr, Failed when trying to open tcp
> connection - connect() failed [rc = -2] [addr = 127.0.1.1:15003]
> 09/02/2013
> 16:23:12;0001;PBS_Server.2422;Svr;PBS_Server;LOG_ERROR::send_hierarchy,
> Could not send mom hierarchy to host erin-Lenovo-IdeaPad-Y480:15003
> 09/02/2013
> 16:23:20;0001;PBS_Server.2420;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad
> attempt to connect from 127.0.0.1:651 (address not trusted - check entry
> in server_priv/nodes)
> 09/02/2013 16:23:23;0002;PBS_Server.2421;Svr;PBS_Server;Torque Server
> Version = 4.2.4.1, loglevel = 0
> 09/02/2013 16:23:28;0100;PBS_Server.2422;Job;7.localhost;enqueuing into
> batch, state 1 hop 1
> 09/02/2013 16:23:28;0008;PBS_Server.2422;Job;req_commit;job_id: 7.localhost
> 09/02/2013
> 16:23:42;0001;PBS_Server.2421;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad
> attempt to connect from 127.0.0.1:651 (address not trusted - check entry
> in server_priv/nodes)
> 09/02/2013 16:23:51;0080;PBS_Server.2422;Job;3.localhost;Unknown Job Id
> Error
> 09/02/2013 16:23:51;0080;PBS_Server.2422;Req;req_reject;Reject reply
> code=15001(Unknown Job Id Error MSG=cannot locate job), aux=0,
> type=DeleteJob, from root at localhost
> 09/02/2013 16:23:51;0008;PBS_Server.2420;Job;6.localhost;Job deleted at
> request of root at localhost
> 09/02/2013 16:23:51;000d;PBS_Server.2422;Job;6.localhost;Email 'd' to
> erin at localhost failed: Child process 'sendmail -f adm erin at localhost'
> returned 127 (errno 0:Success)
> 09/02/2013 16:23:56;0008;PBS_Server.2421;Job;6.localhost;on_job_exit valid
> pjob: 6.localhost (substate=59)
> 09/02/2013 16:23:56;0100;PBS_Server.2421;Job;6.localhost;dequeuing from
> batch, state COMPLETE
> 09/02/2013 16:23:57;0080;PBS_Server.2420;Job;3.localhost;Unknown Job Id
> Error
> 09/02/2013 16:23:57;0080;PBS_Server.2420;Req;req_reject;Reject reply
> code=15001(Unknown Job Id Error MSG=cannot locate job), aux=0,
> type=DeleteJob, from root at localhost
> 09/02/2013 16:24:02;0080;PBS_Server.2422;Job;4.localhost;Unknown Job Id
> Error
> 09/02/2013 16:24:02;0080;PBS_Server.2422;Req;req_reject;Reject reply
> code=15001(Unknown Job Id Error MSG=cannot locate job), aux=0,
> type=DeleteJob, from root at localhost
> 09/02/2013 16:24:06;0080;PBS_Server.2420;Job;5.localhost;Unknown Job Id
> Error
> 09/02/2013 16:24:06;0080;PBS_Server.2420;Req;req_reject;Reject reply
> code=15001(Unknown Job Id Error MSG=cannot locate job), aux=0,
> type=DeleteJob, from root at localhost
> 09/02/2013 16:24:10;0080;PBS_Server.2421;Job;6.localhost;Unknown Job Id
> Error
> 09/02/2013 16:24:10;0080;PBS_Server.2421;Req;req_reject;Reject reply
> code=15001(Unknown Job Id Error MSG=cannot locate job), aux=0,
> type=DeleteJob, from root at localhost
> 09/02/2013
> 16:24:27;0001;PBS_Server.2420;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad
> attempt to connect from 127.0.0.1:651 (address not trusted - check entry
> in server_priv/nodes)
> 09/02/2013 16:24:27;0080;PBS_Server.2422;Req;req_reject;Reject reply
> code=15007(Unauthorized Request ), aux=0, type=RunJob, from erin at localhost
> 09/02/2013 16:24:32;0040;PBS_Server.2421;Req;node_spec;job allocation
> request exceeds currently available cluster nodes, 1 requested, 0 available
> 09/02/2013 16:24:32;0008;PBS_Server.2421;Job;7.localhost;could not locate
> requested resources '1:ppn=1' (node_spec failed) job allocation request
> exceeds currently available cluster nodes, 1 requested, 0 available
> 09/02/2013 16:24:32;0080;PBS_Server.2421;Req;req_reject;Reject reply
> code=15046(Resource temporarily unavailable MSG=job allocation request
> exceeds currently available cluster nodes, 1 requested, 0 available),
> aux=0, type=RunJob, from root at localhost
> 09/02/2013
> 16:25:12;0001;PBS_Server.2420;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad
> attempt to connect from 127.0.0.1:651 (address not trusted - check entry
> in server_priv/nodes)
> 09/02/2013
> 16:25:57;0001;PBS_Server.2420;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad
> attempt to connect from 127.0.0.1:651 (address not trusted - check entry
> in server_priv/nodes)
> 09/02/2013
> 16:26:42;0001;PBS_Server.2421;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad
> attempt to connect from 127.0.0.1:651 (address not trusted - check entry
> in server_priv/nodes)
> 09/02/2013
> 16:27:27;0001;PBS_Server.2422;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad
> attempt to connect from 127.0.0.1:651 (address not trusted - check entry
> in server_priv/nodes)
> 09/02/2013
> 16:28:12;0001;PBS_Server.2420;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad
> attempt to connect from 127.0.0.1:651 (address not trusted - check entry
> in server_priv/nodes)
> 09/02/2013 16:28:31;0002;PBS_Server.2421;Svr;PBS_Server;Torque Server
> Version = 4.2.4.1, loglevel = 0
> 09/02/2013
> 16:28:57;0001;PBS_Server.2421;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad
> attempt to connect from 127.0.0.1:651 (address not trusted - check entry
> in server_priv/nodes)
> 09/02/2013
> 16:29:42;0001;PBS_Server.2420;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad
> attempt to connect from 127.0.0.1:651 (address not trusted - check entry
> in server_priv/nodes)
> 09/02/2013
> 16:30:27;0001;PBS_Server.2422;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad
> attempt to connect from 127.0.0.1:651 (address not trusted - check entry
> in server_priv/nodes)
>
> from pbsnodes:
> pbsnodes
> erin-Lenovo-IdeaPad-Y480
>      state = down
>      np = 8
>      ntype = cluster
>      mom_service_port = 15002
>      mom_manager_port = 15003
>
> Does any of this look familiar, please?  Any help would be much
> appreciated.
>
> Sincerely,
> Erin
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
Ken Nielson
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
www.adaptivecomputing.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130903/60a373b4/attachment-0001.html 


More information about the torqueusers mailing list