[torqueusers] jobs sit in queue forever: torque Newbie

Gus Correa gus at ldeo.columbia.edu
Tue Sep 3 14:39:37 MDT 2013


Thank you Brian

I surely did start trqauthd before pbs_server (server side)
and before pbs_mom (client side).
I tested 4.2.3.1 and 4.2.1, if I remember right.
Not clear why pbs_sched didn't work, but Maui happily did,
with the same exact setup (other than the scheduler).

Gus

On 09/03/2013 03:54 PM, Andrus, Brian Contractor wrote:
> Gus,
>
> I have successfully used pbs_sched with torque 4.x
> One thing I did notice was you definitely needed to start the new trauthqd daemon or nothing worked.
>
>
> Brian Andrus
> ITACS/Research Computing
> Naval Postgraduate School
> Monterey, California
> voice: 831-656-6238
>
>
>
>
>> -----Original Message-----
>> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
>> bounces at supercluster.org] On Behalf Of Gus Correa
>> Sent: Tuesday, September 03, 2013 8:59 AM
>> To: Torque Users Mailing List
>> Subject: Re: [torqueusers] jobs sit in queue forever: torque Newbie
>>
>> Hi Erin
>>
>> If not yet done, you could try:
>>
>> # qmgr -c 'set server scheduling = True'
>>
>> Comments:
>> 1. I had no luck with pbs_sched along with Torque 4.X.Y.Z.
>> 2. Torque 4.X.Y.Z works with Maui, though.
>> 3. pbs_sched works with Torque 2.4.X and 2.5.X, which may be all you need
>> on a laptop.
>>
>> I hope this helps,
>> Gus Correa
>>
>> On 09/02/2013 05:34 PM, Hodgess, Erin wrote:
>>> Hello everyone!
>>>
>>> I have torque installed (I hope) on my Ubuntu laptop successfully but
>>> the jobs just sit in the queue forever.
>>>
>>> Here is the output from the server_log:
>>> 09/02/2013 16:23:12;0002;PBS_Server.2412;Svr;Log;Log opened
>>> 09/02/2013 16:23:12;0006;PBS_Server.2412;Svr;PBS_Server;Server
>>> localhost started, initialization type = 1
>>> 09/02/2013
>>> 16:23:12;0002;PBS_Server.2412;Svr;get_default_threads;Defaulting
>>> min_threads to 17 threads
>>> 09/02/2013 16:23:12;0002;PBS_Server.2412;Svr;Act;Account file
>>> /var/spool/torque/server_priv/accounting/20130902 opened
>>> 09/02/2013
>> 16:23:12;0040;PBS_Server.2412;Req;setup_nodes;setup_nodes()
>>> 09/02/2013 16:23:12;0086;PBS_Server.2412;Svr;PBS_Server;Recovered
>>> queue batch
>>> 09/02/2013 16:23:12;0002;PBS_Server.2412;Svr;PBS_Server;Expected 1,
>>> recovered 1 queues
>>> 09/02/2013 16:23:12;0080;PBS_Server.2412;Svr;PBS_Server;8 total files
>>> read from disk
>>> 09/02/2013
>>> 16:23:12;0100;PBS_Server.2412;Job;3.erin-Lenovo-IdeaPad-
>> Y480;enqueuing
>>> into batch, state 1 hop 1
>>> 09/02/2013
>>> 16:23:12;0086;PBS_Server.2412;Job;3.erin-Lenovo-IdeaPad-
>> Y480;Requeuein
>>> g job, substate: 10 Requeued in queue: batch
>>> 09/02/2013
>>> 16:23:12;0100;PBS_Server.2412;Job;5.erin-Lenovo-IdeaPad-
>> Y480;enqueuing
>>> into batch, state 1 hop 1
>>> 09/02/2013
>>> 16:23:12;0086;PBS_Server.2412;Job;5.erin-Lenovo-IdeaPad-
>> Y480;Requeuein
>>> g job, substate: 10 Requeued in queue: batch
>>> 09/02/2013 16:23:12;0100;PBS_Server.2412;Job;6.localhost;enqueuing
>>> into batch, state 1 hop 1
>>> 09/02/2013 16:23:12;0086;PBS_Server.2412;Job;6.localhost;Requeueing
>>> job,
>>> substate: 10 Requeued in queue: batch
>>> 09/02/2013
>>> 16:23:12;0002;PBS_Server.2412;Svr;PBS_Server;handle_job_recovery:3
>>> 09/02/2013 16:23:12;0006;PBS_Server.2412;Svr;PBS_Server;Using ports
>>> Server:15001 Scheduler:15004 MOM:15002 (server: 'localhost')
>>> 09/02/2013 16:23:12;0002;PBS_Server.2412;Svr;PBS_Server;Server Ready,
>>> pid = 2412, loglevel=0
>>> 09/02/2013
>>> 16:23:12;0001;PBS_Server.2422;Svr;PBS_Server;LOG_ERROR::Operation
>> now
>>> in progress (115) in tcp_connect_sockaddr, Failed when trying to open
>>> tcp connection - connect() failed [rc = -2] [addr = 127.0.1.1:15003]
>>> 09/02/2013
>>>
>> 16:23:12;0001;PBS_Server.2422;Svr;PBS_Server;LOG_ERROR::send_hierarchy
>>> , Could not send mom hierarchy to host erin-Lenovo-IdeaPad-Y480:15003
>>> 09/02/2013
>>> 16:23:20;0001;PBS_Server.2420;Svr;PBS_Server;LOG_ERROR::svr_is_request
>>> , bad attempt to connect from 127.0.0.1:651 (address not trusted -
>>> check entry in server_priv/nodes)
>>> 09/02/2013 16:23:23;0002;PBS_Server.2421;Svr;PBS_Server;Torque Server
>>> Version = 4.2.4.1, loglevel = 0
>>> 09/02/2013 16:23:28;0100;PBS_Server.2422;Job;7.localhost;enqueuing
>>> into batch, state 1 hop 1
>>> 09/02/2013 16:23:28;0008;PBS_Server.2422;Job;req_commit;job_id:
>>> 7.localhost
>>> 09/02/2013
>>> 16:23:42;0001;PBS_Server.2421;Svr;PBS_Server;LOG_ERROR::svr_is_request
>>> , bad attempt to connect from 127.0.0.1:651 (address not trusted -
>>> check entry in server_priv/nodes)
>>> 09/02/2013 16:23:51;0080;PBS_Server.2422;Job;3.localhost;Unknown Job
>>> Id Error
>>> 09/02/2013 16:23:51;0080;PBS_Server.2422;Req;req_reject;Reject reply
>>> code=15001(Unknown Job Id Error MSG=cannot locate job), aux=0,
>>> type=DeleteJob, from root at localhost
>>> 09/02/2013 16:23:51;0008;PBS_Server.2420;Job;6.localhost;Job deleted
>>> at request of root at localhost
>>> 09/02/2013 16:23:51;000d;PBS_Server.2422;Job;6.localhost;Email 'd' to
>>> erin at localhost failed: Child process 'sendmail -f adm erin at localhost'
>>> returned 127 (errno 0:Success)
>>> 09/02/2013 16:23:56;0008;PBS_Server.2421;Job;6.localhost;on_job_exit
>>> valid pjob: 6.localhost (substate=59)
>>> 09/02/2013 16:23:56;0100;PBS_Server.2421;Job;6.localhost;dequeuing
>>> from batch, state COMPLETE
>>> 09/02/2013 16:23:57;0080;PBS_Server.2420;Job;3.localhost;Unknown Job
>>> Id Error
>>> 09/02/2013 16:23:57;0080;PBS_Server.2420;Req;req_reject;Reject reply
>>> code=15001(Unknown Job Id Error MSG=cannot locate job), aux=0,
>>> type=DeleteJob, from root at localhost
>>> 09/02/2013 16:24:02;0080;PBS_Server.2422;Job;4.localhost;Unknown Job
>>> Id Error
>>> 09/02/2013 16:24:02;0080;PBS_Server.2422;Req;req_reject;Reject reply
>>> code=15001(Unknown Job Id Error MSG=cannot locate job), aux=0,
>>> type=DeleteJob, from root at localhost
>>> 09/02/2013 16:24:06;0080;PBS_Server.2420;Job;5.localhost;Unknown Job
>>> Id Error
>>> 09/02/2013 16:24:06;0080;PBS_Server.2420;Req;req_reject;Reject reply
>>> code=15001(Unknown Job Id Error MSG=cannot locate job), aux=0,
>>> type=DeleteJob, from root at localhost
>>> 09/02/2013 16:24:10;0080;PBS_Server.2421;Job;6.localhost;Unknown Job
>>> Id Error
>>> 09/02/2013 16:24:10;0080;PBS_Server.2421;Req;req_reject;Reject reply
>>> code=15001(Unknown Job Id Error MSG=cannot locate job), aux=0,
>>> type=DeleteJob, from root at localhost
>>> 09/02/2013
>>> 16:24:27;0001;PBS_Server.2420;Svr;PBS_Server;LOG_ERROR::svr_is_request
>>> , bad attempt to connect from 127.0.0.1:651 (address not trusted -
>>> check entry in server_priv/nodes)
>>> 09/02/2013 16:24:27;0080;PBS_Server.2422;Req;req_reject;Reject reply
>>> code=15007(Unauthorized Request ), aux=0, type=RunJob, from
>>> erin at localhost
>>> 09/02/2013 16:24:32;0040;PBS_Server.2421;Req;node_spec;job allocation
>>> request exceeds currently available cluster nodes, 1 requested, 0
>>> available
>>> 09/02/2013 16:24:32;0008;PBS_Server.2421;Job;7.localhost;could not
>>> locate requested resources '1:ppn=1' (node_spec failed) job allocation
>>> request exceeds currently available cluster nodes, 1 requested, 0
>>> available
>>> 09/02/2013 16:24:32;0080;PBS_Server.2421;Req;req_reject;Reject reply
>>> code=15046(Resource temporarily unavailable MSG=job allocation request
>>> exceeds currently available cluster nodes, 1 requested, 0 available),
>>> aux=0, type=RunJob, from root at localhost
>>> 09/02/2013
>>> 16:25:12;0001;PBS_Server.2420;Svr;PBS_Server;LOG_ERROR::svr_is_request
>>> , bad attempt to connect from 127.0.0.1:651 (address not trusted -
>>> check entry in server_priv/nodes)
>>> 09/02/2013
>>> 16:25:57;0001;PBS_Server.2420;Svr;PBS_Server;LOG_ERROR::svr_is_request
>>> , bad attempt to connect from 127.0.0.1:651 (address not trusted -
>>> check entry in server_priv/nodes)
>>> 09/02/2013
>>> 16:26:42;0001;PBS_Server.2421;Svr;PBS_Server;LOG_ERROR::svr_is_request
>>> , bad attempt to connect from 127.0.0.1:651 (address not trusted -
>>> check entry in server_priv/nodes)
>>> 09/02/2013
>>> 16:27:27;0001;PBS_Server.2422;Svr;PBS_Server;LOG_ERROR::svr_is_request
>>> , bad attempt to connect from 127.0.0.1:651 (address not trusted -
>>> check entry in server_priv/nodes)
>>> 09/02/2013
>>> 16:28:12;0001;PBS_Server.2420;Svr;PBS_Server;LOG_ERROR::svr_is_request
>>> , bad attempt to connect from 127.0.0.1:651 (address not trusted -
>>> check entry in server_priv/nodes)
>>> 09/02/2013 16:28:31;0002;PBS_Server.2421;Svr;PBS_Server;Torque Server
>>> Version = 4.2.4.1, loglevel = 0
>>> 09/02/2013
>>> 16:28:57;0001;PBS_Server.2421;Svr;PBS_Server;LOG_ERROR::svr_is_request
>>> , bad attempt to connect from 127.0.0.1:651 (address not trusted -
>>> check entry in server_priv/nodes)
>>> 09/02/2013
>>> 16:29:42;0001;PBS_Server.2420;Svr;PBS_Server;LOG_ERROR::svr_is_request
>>> , bad attempt to connect from 127.0.0.1:651 (address not trusted -
>>> check entry in server_priv/nodes)
>>> 09/02/2013
>>> 16:30:27;0001;PBS_Server.2422;Svr;PBS_Server;LOG_ERROR::svr_is_request
>>> , bad attempt to connect from 127.0.0.1:651 (address not trusted -
>>> check entry in server_priv/nodes)
>>>
>>> from pbsnodes:
>>> pbsnodes
>>> erin-Lenovo-IdeaPad-Y480
>>> state = down
>>> np = 8
>>> ntype = cluster
>>> mom_service_port = 15002
>>> mom_manager_port = 15003
>>>
>>> Does any of this look familiar, please? Any help would be much
>> appreciated.
>>>
>>> Sincerely,
>>> Erin
>>>
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list