[torqueusers] error in pbs_iff: cannot read reply from pbs_server
Guilherme Menegon Arantes
garantes at iq.usp.br
Wed Sep 17 12:38:38 MDT 2008
Dear Torque users,
My Torque installation works fine, but when I submitted a large amount
of jobs in a row (say more than 10 or 15), I get the following error
message:
pbs_iff: cannot read reply from pbs_server
No Permission.
qsub: cannot connect to server node5 (errno=15007)
where node5 is my Torque server. This error is seen both for qsub,
qstat or pbsnodes, everytime a large amount of jobs is submitted.
Checking the server logs, I see errors like:
09/17/2008 09:58:33;0080;PBS_Server;Req;req_reject;Reject reply code=15019(Invalid credential MSG=cannot authenticate user), aux=0, type=AuthenticateUser, from garantes at node5.full_server_name
where the server full domain name was not copied here, but is shown
in the logs. I am running Torque 2.3.0 and this error is seen when
either default pbs_sched or Maui (3.2.6p19) are running as Schedulers.
Follows the output of qmgr -c "print server":
#
# Create and define queue normal
#
create queue normal
set queue normal queue_type = Execution
set queue normal max_queuable = 50
set queue normal max_user_queuable = 50
set queue normal acl_host_enable = False
set queue normal acl_hosts = node6
set queue normal acl_hosts += node4
set queue normal acl_hosts += node3
set queue normal acl_hosts += node2
set queue normal acl_hosts += node1
set queue normal resources_default.nodes = 1
set queue normal resources_default.walltime = 240:00:00
set queue normal acl_group_enable = True
set queue normal acl_groups = users
set queue normal enabled = True
set queue normal started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = node5
set server managers = garantes@*.full_server_name
set server managers += garantes at full_server_name
set server operators = garantes@*.full_server_name
set server operators += garantes at full_server_name
set server default_queue = normal
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 4
set server next_job_number = 992
where the server full domain name was not copied here, but is shown
in the logs.
Any clues on what is going on? Any server parameters I should change to
avoid this error?
I have searched the arquives in clusterresources.com w/o much success,
so any help is appreciated.
Regards,
G
--
Guilherme Menegon Arantes, PhD São Paulo, Brasil
______________________________________________________
More information about the torqueusers
mailing list