[torqueusers] request type QueueJob from host hulk rejected (host not authorized)
Thomas Fischer
tfischer at dc.uba.ar
Wed Mar 7 11:33:27 MST 2012
Hi all,
I ambuilduing up a new cluster running debian lenny, and i decided to
switch to torque.
Until now I just manged to do a first install of torque (version 2.4.8
from lenny-backports repo) and Maui (3.3.1 from source) on the server
(called hulk), and torque-mom on one execution node (called nodo-32).
I followed the guide on debianclusters.org to do so.
Everything seemed to be working, services are running, etc., but when
i try to submit a test job (echo "sleep 30") with a user, the job is
queued and deferred by maui. Here are what i consider relevant
outputs:
--------------------------------------------------
tfischer at hulk:~$ echo "sleep 30" | qsub
13.hulk
--------------------------------------------------
tfischer at hulk:~$ qstat
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
13.hulk STDIN tfischer 0 Q main.queue
--------------------------------------------------
root at hulk:~# qrun -H nodo-32 13
qrun: Execution server rejected request MSG=cannot send job to mom,
state=PRERUN 13.hulk
--------------------------------------------------
tfischer at hulk:~$ /usr/local/maui/bin/checkjob 13
checking job 13
State: Idle EState: Deferred
Creds: user:tfischer group:tfischer class:main.queue qos:DEFAULT
WallTime: 00:00:00 of 1:00:00
SubmitTime: Wed Mar 7 15:08:05
(Time Queued Total: 00:15:13 Eligible: 00:00:02)
StartDate: -00:15:10 Wed Mar 7 15:08:08
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE
job is deferred. Reason: RMFailure (cannot start job - RM failure,
rc: 15041, msg: 'Execution server rejected request MSG=cannot send job
to mom, state=PRERUN')
Holds: Defer (hold reason: RMFailure)
PE: 1.00 StartPriority: 1
cannot select job 13 for partition DEFAULT (job hold active)
--------------------------------------------------
root at hulk:~# pbsnodes -a
nodo-32
state = free
np = 16
ntype = cluster
status = opsys=linux,uname=Linux nodo-32 2.6.26.x3550m3 #1 SMP
Mon Jan 23 11:51:03 ART 2012
x86_64,sessions=5677,nsessions=1,nusers=1,idletime=164255,totmem=24817844kb,availmem=24725496kb,physmem=16431924kb,ncpus=16,loadave=0.00,netload=26551478,state=free,jobs=,varattr=,rectime=1331144485
nodo-33
state = down
np = 1
ntype = cluster
--------------------------------------------------
from hulk:/var/spool/torque/server_logs/20120307
hulk PBS_Server: LOG_ERROR::Access from host not allowed, or unknown
host (15008) in send_job, child failed in previous commit request for
job 13.hulk
--------------------------------------------------
from nodo-32:/var/spool/torque/mom_logs/20120307
pbs_mom;Req;req_reject;Reject reply code=15008(Access from host not
allowed, or unknown host MSG=request not authorized), aux=0,
type=QueueJob, from PBS_Server at hulk
--------------------------------------------------
seems like the node is rejecting jobs from the server. The server name
is defined at the host like
nodo-32:~# cat /var/spool/torque/server_name
hulk
Is there something i am forgetting about or missconfiguring?
Thanks in advance,
Thomas Fischer
--
restate my assumptions:
1. Mathematics is the language of nature.
2. Everything around us can be represented and understood through numbers.
3. If you graph these numbers, patterns emerge. Therefore: There are
patterns everywhere in nature.
Max Cohen, PI
More information about the torqueusers
mailing list