[torqueusers] Job hang on newly Torque setup. Possible errors
withhostname case sensitive. (CentOS - Rocks 5)
Greenseid, Joseph M.
Joseph.Greenseid at ngc.com
Mon Nov 3 08:49:37 MST 2008
it looks like torque thinks your server name is the jupiter.local name, not jupiter.mynetwork.com.
is the server name defined in a file in the pbs spool directory (/var/spool/pbs or something like that)? if so, what happens if you change it in there?
--Joe
________________________________
From: torqueusers-bounces at supercluster.org on behalf of Steven Truong
Sent: Sat 11/1/2008 2:36 AM
To: torqueusers at supercluster.org
Subject: [torqueusers] Job hang on newly Torque setup. Possible errors withhostname case sensitive. (CentOS - Rocks 5)
I learned my lession the hard way and set my head node where torque
server is running. The fqdm is Jupiter.mynetwork.com where the
followings are entries related to the head node:
/etc/hosts
10.1.1.1 Jupiter.local Jupiter # originally frontend-0-0
192.168.0.181 Jupiter.mynetwork.com
/etc/sysconfig/network
....
HOSTNAME=Jupiter.mynetwork.com
-----------
A test user submitted a job and this job got queued and I found in
torque's log indicated that "(No de
fault queue specified MSG=cannot locate queue)". The root user try to
run "qrun 1" and I would got an error message
indicating something along the lines of "not being able to find/locate
the mentioned job".
Initially in my torque server set up, I only have "set server managers
= root at jupiter.mynetwork.com" and I was able to add the second one and
none others.
This is a setup on Rocks 5 and there is a command in
/opt/torque/bin/hostn and I am not sure if this one originally comes
with Torque but here are what i found:
$ hostname
Jupiter.mynetwork.com
[root at Jupiter server_logs]# cd /opt/torque/bin/
[root at Jupiter bin]# hostn
Usage: hostn [-v] hostname
-v turns on verbose output
[root at Jupiter bin]# hostn -v jupiter
primary name: Jupiter.local (from gethostbyname())
aliases: Jupiter
address length: 4 bytes
address: 10.1.1.1 (16843018 dec) name: Jupiter.local
[root at Jupiter bin]# hostn -v Jupiter
primary name: Jupiter.local (from gethostbyname())
aliases: Jupiter
address length: 4 bytes
address: 10.1.1.1 (16843018 dec) name: Jupiter.local
[root at Jupiter bin]# hostn -v Jupiter.mynetwork.com
primary name: Jupiter.mynetwork.com (from gethostbyname())
aliases: -none-
address length: 4 bytes
address: 192.168.0.181 (3036719296 dec) name:
Jupiter.mynetwork.com
[root at Jupiter bin]# hostn -v jupiter.mynetwork.com
primary name: Jupiter.mynetwork.com (from gethostbyname())
aliases: -none-
address length: 4 bytes
address: 192.168.0.181 (3036719296 dec) name:
Jupiter.mynetwork.com
[root at Jupiter bin]# hostn -v Jupiter.local
primary name: Jupiter.local (from gethostbyname())
aliases: Jupiter
address length: 4 bytes
address: 10.1.1.1 (16843018 dec) name: Jupiter.local
[root at Jupiter bin]# hostn -v jupiter.local
primary name: Jupiter.local (from gethostbyname())
aliases: Jupiter
address length: 4 bytes
address: 10.1.1.1 (16843018 dec) name: Jupiter.local
------------------------------------------------------------------------------
Here are a very simple config of torque
$ qmgr -c 'p s'
#
# Create queues and set their attributes.
#
#
# Create and define queue default
#
create queue default
set queue default queue_type = Execution
set queue default kill_delay = 90
set queue default enabled = True
set queue default started = True
#
# Set server attributes.
#
set server acl_hosts = jupiter
set server acl_hosts += Jupiter
set server acl_hosts += jupiter.mynetwork.com
set server acl_hosts += jupiter.local
set server acl_hosts += Jupiter.local
set server acl_hosts += Jupiter.mynetwork.com
set server managers = root at jupiter.mynetwork.com
set server managers += root at jupiter.local
set server log_events = 511
set server mail_from = adm
set server resources_default.walltime = 336:00:00
set server scheduler_iteration = 60
set server node_ping_rate = 300
set server node_check_rate = 600
set server tcp_timeout = 6
set server node_pack = False
set server next_job_number = 2
Here are the errors message in torque's log
10/31/2008 22:59:23;0001;PBS_Server;Svr;PBS_Server;req_quejob, requested queue n
ot found
10/31/2008 22:59:23;0080;PBS_Server;Req;req_reject;Reject reply code=15037(No de
fault queue specified MSG=cannot locate queue), aux=0, type=QueueJob, from struo
ng at jupiter.mynetwork.com
10/31/2008 23:03:27;0100;PBS_Server;Job;1.jupiter.mynetwork.com;enqueuing into
default, state 1 hop 1
10/31/2008 23:03:27;0008;PBS_Server;Job;1.jupiter.mynetwork.com;Job Queued at
request of testuser at jupiter.mynetwork.com, owner = testuser at jupiter.mynetwork.
com, job name = PtPd_3.N.6ML.fcc2.or, queue = default
10/31/2008 23:03:52;0080;PBS_Server;Job;1.jupiter.local;Unknown Job Id
10/31/2008 23:03:52;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unkno
wn Job Id MSG=cannot locate job), aux=0, type=RunJob, from root at Jupiter.local
10/31/2008 23:03:52;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unkno
wn Job Id), aux=0, type=LocateJob, from root at Jupiter.local
10/31/2008 23:05:56;0080;PBS_Server;Job;1.jupiter.local;Unknown Job Id
10/31/2008 23:05:56;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unkno
wn Job Id MSG=cannot locate job), aux=0, type=RunJob, from root at Jupiter.local
10/31/2008 23:05:56;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unkno
wn Job Id), aux=0, type=LocateJob, from root at Jupiter.local
10/31/2008 23:06:05;0080;PBS_Server;Job;1.jupiter.local;Unknown Job Id
10/31/2008 23:06:05;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unkno
wn Job Id MSG=cannot locate job), aux=0, type=RunJob, from root at Jupiter.local
10/31/2008 23:06:05;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unkno
wn Job Id), aux=0, type=LocateJob, from root at Jupiter.local
Please tell me how to fix this.
Thank you very much.
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20081103/3c01b680/attachment.html
More information about the torqueusers
mailing list