[torqueusers] jobs sit in queue forever: torque Newbie

Andrus, Brian Contractor bdandrus at nps.edu
Tue Sep 3 14:14:19 MDT 2013


service trauthqd start
or 
/etc/init.d/trauthqd start


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238




> -----Original Message-----
> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> bounces at supercluster.org] On Behalf Of Hodgess, Erin
> Sent: Tuesday, September 03, 2013 12:58 PM
> To: Torque Users Mailing List
> Subject: Re: [torqueusers] jobs sit in queue forever: torque Newbie
> 
> Dumb question:  how do you start the trauthqd, please?
> 
> Thanks,
> Erin
> 
> 
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> bounces at supercluster.org] On Behalf Of Andrus, Brian Contractor
> Sent: Tuesday, September 03, 2013 2:54 PM
> To: Torque Users Mailing List
> Subject: Re: [torqueusers] jobs sit in queue forever: torque Newbie
> 
> Gus,
> 
> I have successfully used pbs_sched with torque 4.x One thing I did notice
> was you definitely needed to start the new trauthqd daemon or nothing
> worked.
> 
> 
> Brian Andrus
> ITACS/Research Computing
> Naval Postgraduate School
> Monterey, California
> voice: 831-656-6238
> 
> 
> 
> 
> > -----Original Message-----
> > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> > bounces at supercluster.org] On Behalf Of Gus Correa
> > Sent: Tuesday, September 03, 2013 8:59 AM
> > To: Torque Users Mailing List
> > Subject: Re: [torqueusers] jobs sit in queue forever: torque Newbie
> >
> > Hi Erin
> >
> > If not yet done, you could try:
> >
> > # qmgr -c 'set server scheduling = True'
> >
> > Comments:
> > 1. I had no luck with pbs_sched along with Torque 4.X.Y.Z.
> > 2. Torque 4.X.Y.Z works with Maui, though.
> > 3. pbs_sched works with Torque 2.4.X and 2.5.X, which may be all you
> > need on a laptop.
> >
> > I hope this helps,
> > Gus Correa
> >
> > On 09/02/2013 05:34 PM, Hodgess, Erin wrote:
> > > Hello everyone!
> > >
> > > I have torque installed (I hope) on my Ubuntu laptop successfully
> > > but the jobs just sit in the queue forever.
> > >
> > > Here is the output from the server_log:
> > > 09/02/2013 16:23:12;0002;PBS_Server.2412;Svr;Log;Log opened
> > > 09/02/2013 16:23:12;0006;PBS_Server.2412;Svr;PBS_Server;Server
> > > localhost started, initialization type = 1
> > > 09/02/2013
> > > 16:23:12;0002;PBS_Server.2412;Svr;get_default_threads;Defaulting
> > > min_threads to 17 threads
> > > 09/02/2013 16:23:12;0002;PBS_Server.2412;Svr;Act;Account file
> > > /var/spool/torque/server_priv/accounting/20130902 opened
> > > 09/02/2013
> > 16:23:12;0040;PBS_Server.2412;Req;setup_nodes;setup_nodes()
> > > 09/02/2013 16:23:12;0086;PBS_Server.2412;Svr;PBS_Server;Recovered
> > > queue batch
> > > 09/02/2013 16:23:12;0002;PBS_Server.2412;Svr;PBS_Server;Expected 1,
> > > recovered 1 queues
> > > 09/02/2013 16:23:12;0080;PBS_Server.2412;Svr;PBS_Server;8 total
> > > files read from disk
> > > 09/02/2013
> > > 16:23:12;0100;PBS_Server.2412;Job;3.erin-Lenovo-IdeaPad-
> > Y480;enqueuing
> > > into batch, state 1 hop 1
> > > 09/02/2013
> > > 16:23:12;0086;PBS_Server.2412;Job;3.erin-Lenovo-IdeaPad-
> > Y480;Requeuein
> > > g job, substate: 10 Requeued in queue: batch
> > > 09/02/2013
> > > 16:23:12;0100;PBS_Server.2412;Job;5.erin-Lenovo-IdeaPad-
> > Y480;enqueuing
> > > into batch, state 1 hop 1
> > > 09/02/2013
> > > 16:23:12;0086;PBS_Server.2412;Job;5.erin-Lenovo-IdeaPad-
> > Y480;Requeuein
> > > g job, substate: 10 Requeued in queue: batch
> > > 09/02/2013 16:23:12;0100;PBS_Server.2412;Job;6.localhost;enqueuing
> > > into batch, state 1 hop 1
> > > 09/02/2013 16:23:12;0086;PBS_Server.2412;Job;6.localhost;Requeueing
> > > job,
> > > substate: 10 Requeued in queue: batch
> > > 09/02/2013
> > > 16:23:12;0002;PBS_Server.2412;Svr;PBS_Server;handle_job_recovery:3
> > > 09/02/2013 16:23:12;0006;PBS_Server.2412;Svr;PBS_Server;Using ports
> > > Server:15001 Scheduler:15004 MOM:15002 (server: 'localhost')
> > > 09/02/2013 16:23:12;0002;PBS_Server.2412;Svr;PBS_Server;Server
> > > Ready, pid = 2412, loglevel=0
> > > 09/02/2013
> > > 16:23:12;0001;PBS_Server.2422;Svr;PBS_Server;LOG_ERROR::Operation
> > now
> > > in progress (115) in tcp_connect_sockaddr, Failed when trying to
> > > open tcp connection - connect() failed [rc = -2] [addr =
> > > 127.0.1.1:15003]
> > > 09/02/2013
> > >
> >
> 16:23:12;0001;PBS_Server.2422;Svr;PBS_Server;LOG_ERROR::send_hierarchy
> > > , Could not send mom hierarchy to host
> > > erin-Lenovo-IdeaPad-Y480:15003
> > > 09/02/2013
> > > 16:23:20;0001;PBS_Server.2420;Svr;PBS_Server;LOG_ERROR::svr_is_reque
> > > st , bad attempt to connect from 127.0.0.1:651 (address not trusted
> > > - check entry in server_priv/nodes)
> > > 09/02/2013 16:23:23;0002;PBS_Server.2421;Svr;PBS_Server;Torque
> > > Server Version = 4.2.4.1, loglevel = 0
> > > 09/02/2013 16:23:28;0100;PBS_Server.2422;Job;7.localhost;enqueuing
> > > into batch, state 1 hop 1
> > > 09/02/2013 16:23:28;0008;PBS_Server.2422;Job;req_commit;job_id:
> > > 7.localhost
> > > 09/02/2013
> > > 16:23:42;0001;PBS_Server.2421;Svr;PBS_Server;LOG_ERROR::svr_is_reque
> > > st , bad attempt to connect from 127.0.0.1:651 (address not trusted
> > > - check entry in server_priv/nodes)
> > > 09/02/2013 16:23:51;0080;PBS_Server.2422;Job;3.localhost;Unknown Job
> > > Id Error
> > > 09/02/2013 16:23:51;0080;PBS_Server.2422;Req;req_reject;Reject reply
> > > code=15001(Unknown Job Id Error MSG=cannot locate job), aux=0,
> > > type=DeleteJob, from root at localhost
> > > 09/02/2013 16:23:51;0008;PBS_Server.2420;Job;6.localhost;Job deleted
> > > at request of root at localhost
> > > 09/02/2013 16:23:51;000d;PBS_Server.2422;Job;6.localhost;Email 'd'
> > > to erin at localhost failed: Child process 'sendmail -f adm erin at localhost'
> > > returned 127 (errno 0:Success)
> > > 09/02/2013 16:23:56;0008;PBS_Server.2421;Job;6.localhost;on_job_exit
> > > valid pjob: 6.localhost (substate=59)
> > > 09/02/2013 16:23:56;0100;PBS_Server.2421;Job;6.localhost;dequeuing
> > > from batch, state COMPLETE
> > > 09/02/2013 16:23:57;0080;PBS_Server.2420;Job;3.localhost;Unknown Job
> > > Id Error
> > > 09/02/2013 16:23:57;0080;PBS_Server.2420;Req;req_reject;Reject reply
> > > code=15001(Unknown Job Id Error MSG=cannot locate job), aux=0,
> > > type=DeleteJob, from root at localhost
> > > 09/02/2013 16:24:02;0080;PBS_Server.2422;Job;4.localhost;Unknown Job
> > > Id Error
> > > 09/02/2013 16:24:02;0080;PBS_Server.2422;Req;req_reject;Reject reply
> > > code=15001(Unknown Job Id Error MSG=cannot locate job), aux=0,
> > > type=DeleteJob, from root at localhost
> > > 09/02/2013 16:24:06;0080;PBS_Server.2420;Job;5.localhost;Unknown Job
> > > Id Error
> > > 09/02/2013 16:24:06;0080;PBS_Server.2420;Req;req_reject;Reject reply
> > > code=15001(Unknown Job Id Error MSG=cannot locate job), aux=0,
> > > type=DeleteJob, from root at localhost
> > > 09/02/2013 16:24:10;0080;PBS_Server.2421;Job;6.localhost;Unknown Job
> > > Id Error
> > > 09/02/2013 16:24:10;0080;PBS_Server.2421;Req;req_reject;Reject reply
> > > code=15001(Unknown Job Id Error MSG=cannot locate job), aux=0,
> > > type=DeleteJob, from root at localhost
> > > 09/02/2013
> > > 16:24:27;0001;PBS_Server.2420;Svr;PBS_Server;LOG_ERROR::svr_is_reque
> > > st , bad attempt to connect from 127.0.0.1:651 (address not trusted
> > > - check entry in server_priv/nodes)
> > > 09/02/2013 16:24:27;0080;PBS_Server.2422;Req;req_reject;Reject reply
> > > code=15007(Unauthorized Request ), aux=0, type=RunJob, from
> > > erin at localhost
> > > 09/02/2013 16:24:32;0040;PBS_Server.2421;Req;node_spec;job
> > > allocation request exceeds currently available cluster nodes, 1
> > > requested, 0 available
> > > 09/02/2013 16:24:32;0008;PBS_Server.2421;Job;7.localhost;could not
> > > locate requested resources '1:ppn=1' (node_spec failed) job
> > > allocation request exceeds currently available cluster nodes, 1
> > > requested, 0 available
> > > 09/02/2013 16:24:32;0080;PBS_Server.2421;Req;req_reject;Reject reply
> > > code=15046(Resource temporarily unavailable MSG=job allocation
> > > request exceeds currently available cluster nodes, 1 requested, 0
> > > available), aux=0, type=RunJob, from root at localhost
> > > 09/02/2013
> > > 16:25:12;0001;PBS_Server.2420;Svr;PBS_Server;LOG_ERROR::svr_is_reque
> > > st , bad attempt to connect from 127.0.0.1:651 (address not trusted
> > > - check entry in server_priv/nodes)
> > > 09/02/2013
> > > 16:25:57;0001;PBS_Server.2420;Svr;PBS_Server;LOG_ERROR::svr_is_reque
> > > st , bad attempt to connect from 127.0.0.1:651 (address not trusted
> > > - check entry in server_priv/nodes)
> > > 09/02/2013
> > > 16:26:42;0001;PBS_Server.2421;Svr;PBS_Server;LOG_ERROR::svr_is_reque
> > > st , bad attempt to connect from 127.0.0.1:651 (address not trusted
> > > - check entry in server_priv/nodes)
> > > 09/02/2013
> > > 16:27:27;0001;PBS_Server.2422;Svr;PBS_Server;LOG_ERROR::svr_is_reque
> > > st , bad attempt to connect from 127.0.0.1:651 (address not trusted
> > > - check entry in server_priv/nodes)
> > > 09/02/2013
> > > 16:28:12;0001;PBS_Server.2420;Svr;PBS_Server;LOG_ERROR::svr_is_reque
> > > st , bad attempt to connect from 127.0.0.1:651 (address not trusted
> > > - check entry in server_priv/nodes)
> > > 09/02/2013 16:28:31;0002;PBS_Server.2421;Svr;PBS_Server;Torque
> > > Server Version = 4.2.4.1, loglevel = 0
> > > 09/02/2013
> > > 16:28:57;0001;PBS_Server.2421;Svr;PBS_Server;LOG_ERROR::svr_is_reque
> > > st , bad attempt to connect from 127.0.0.1:651 (address not trusted
> > > - check entry in server_priv/nodes)
> > > 09/02/2013
> > > 16:29:42;0001;PBS_Server.2420;Svr;PBS_Server;LOG_ERROR::svr_is_reque
> > > st , bad attempt to connect from 127.0.0.1:651 (address not trusted
> > > - check entry in server_priv/nodes)
> > > 09/02/2013
> > > 16:30:27;0001;PBS_Server.2422;Svr;PBS_Server;LOG_ERROR::svr_is_reque
> > > st , bad attempt to connect from 127.0.0.1:651 (address not trusted
> > > - check entry in server_priv/nodes)
> > >
> > > from pbsnodes:
> > > pbsnodes
> > > erin-Lenovo-IdeaPad-Y480
> > > state = down
> > > np = 8
> > > ntype = cluster
> > > mom_service_port = 15002
> > > mom_manager_port = 15003
> > >
> > > Does any of this look familiar, please? Any help would be much
> > appreciated.
> > >
> > > Sincerely,
> > > Erin
> > >
> > >
> > >
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list