[torqueusers] request type QueueJob from host hulk rejected (host not authorized)

Thomas Fischer tfischer at dc.uba.ar
Wed Mar 7 11:33:27 MST 2012


Hi all,

I ambuilduing up a new cluster running debian lenny, and i decided to
switch to torque.
Until now I just manged to do a first install of torque (version 2.4.8
from lenny-backports repo) and Maui (3.3.1 from source) on the server
(called hulk), and torque-mom on one execution node (called nodo-32).
I followed the guide on debianclusters.org to do so.
Everything seemed to be working, services are running, etc., but when
i try to submit a test job (echo "sleep 30") with a user, the job is
queued and deferred by maui. Here are what i consider relevant
outputs:

--------------------------------------------------

tfischer at hulk:~$ echo "sleep 30" | qsub
13.hulk

--------------------------------------------------

tfischer at hulk:~$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
13.hulk                   STDIN            tfischer               0 Q main.queue

--------------------------------------------------

root at hulk:~# qrun -H nodo-32 13
qrun: Execution server rejected request MSG=cannot send job to mom,
state=PRERUN 13.hulk

--------------------------------------------------

tfischer at hulk:~$ /usr/local/maui/bin/checkjob 13

checking job 13

State: Idle  EState: Deferred
Creds:  user:tfischer  group:tfischer  class:main.queue  qos:DEFAULT
WallTime: 00:00:00 of 1:00:00
SubmitTime: Wed Mar  7 15:08:05
  (Time Queued  Total: 00:15:13  Eligible: 00:00:02)

StartDate: -00:15:10  Wed Mar  7 15:08:08
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE

job is deferred.  Reason:  RMFailure  (cannot start job - RM failure,
rc: 15041, msg: 'Execution server rejected request MSG=cannot send job
to mom, state=PRERUN')
Holds:    Defer  (hold reason:  RMFailure)
PE:  1.00  StartPriority:  1
cannot select job 13 for partition DEFAULT (job hold active)

--------------------------------------------------

root at hulk:~# pbsnodes -a
nodo-32
     state = free
     np = 16
     ntype = cluster
     status = opsys=linux,uname=Linux nodo-32 2.6.26.x3550m3 #1 SMP
Mon Jan 23 11:51:03 ART 2012
x86_64,sessions=5677,nsessions=1,nusers=1,idletime=164255,totmem=24817844kb,availmem=24725496kb,physmem=16431924kb,ncpus=16,loadave=0.00,netload=26551478,state=free,jobs=,varattr=,rectime=1331144485

nodo-33
     state = down
     np = 1
     ntype = cluster
--------------------------------------------------

from hulk:/var/spool/torque/server_logs/20120307
hulk PBS_Server: LOG_ERROR::Access from host not allowed, or unknown
host (15008) in send_job, child failed in previous commit request for
job 13.hulk

--------------------------------------------------

from nodo-32:/var/spool/torque/mom_logs/20120307
pbs_mom;Req;req_reject;Reject reply code=15008(Access from host not
allowed, or unknown host MSG=request not authorized), aux=0,
type=QueueJob, from PBS_Server at hulk

--------------------------------------------------

seems like the node is rejecting jobs from the server. The server name
is defined at the host like

nodo-32:~# cat /var/spool/torque/server_name
hulk

Is there something i am forgetting about or missconfiguring?

Thanks in advance,

Thomas Fischer

-- 
restate my assumptions:
1. Mathematics is the language of nature.
2. Everything around us can be represented and understood through numbers.
3. If you graph these numbers, patterns emerge. Therefore: There are
patterns everywhere in nature.

Max Cohen, PI


More information about the torqueusers mailing list