[torqueusers] Job hang on newly Torque setup. Possible errors with hostname case sensitive. (CentOS - Rocks 5)

Steven Truong midair77 at gmail.com
Sat Nov 1 00:36:52 MDT 2008


I learned my lession the hard way and set my head node where torque
server is running.  The fqdm is Jupiter.mynetwork.com where the
followings are entries related to the head node:

/etc/hosts
10.1.1.1        Jupiter.local Jupiter # originally frontend-0-0
192.168.0.181   Jupiter.mynetwork.com

/etc/sysconfig/network
....
HOSTNAME=Jupiter.mynetwork.com
-----------
A test user submitted a job and this job got queued and I found in
torque's log indicated that "(No de
fault queue specified MSG=cannot locate queue)".  The root user try to
run "qrun 1" and I would got an error message
indicating something along the lines of "not being able to find/locate
the mentioned job".

Initially in my torque server set up, I only have "set server managers
= root at jupiter.mynetwork.com" and I was able to add the second one and
none others.

This is a setup on Rocks 5 and there is a command in
/opt/torque/bin/hostn and I am not sure if this one originally comes
with Torque but here are what i found:

$ hostname
Jupiter.mynetwork.com
[root at Jupiter server_logs]# cd /opt/torque/bin/
[root at Jupiter bin]# hostn
Usage: hostn [-v] hostname
         -v turns on verbose output
[root at Jupiter bin]# hostn -v jupiter
primary name:  Jupiter.local (from gethostbyname())
aliases:           Jupiter
     address length:  4 bytes
     address:             10.1.1.1   (16843018 dec)  name:  Jupiter.local
[root at Jupiter bin]# hostn -v Jupiter
primary name:  Jupiter.local (from gethostbyname())
aliases:           Jupiter
     address length:  4 bytes
     address:             10.1.1.1   (16843018 dec)  name:  Jupiter.local
[root at Jupiter bin]# hostn -v Jupiter.mynetwork.com
primary name:  Jupiter.mynetwork.com (from gethostbyname())
aliases:            -none-
     address length:  4 bytes
     address:        192.168.0.181   (3036719296 dec)  name:
Jupiter.mynetwork.com
[root at Jupiter bin]# hostn -v jupiter.mynetwork.com
primary name:  Jupiter.mynetwork.com (from gethostbyname())
aliases:            -none-
     address length:  4 bytes
     address:        192.168.0.181   (3036719296 dec)  name:
Jupiter.mynetwork.com
[root at Jupiter bin]# hostn -v Jupiter.local
primary name:  Jupiter.local (from gethostbyname())
aliases:           Jupiter
     address length:  4 bytes
     address:             10.1.1.1   (16843018 dec)  name:  Jupiter.local
[root at Jupiter bin]# hostn -v jupiter.local
primary name:  Jupiter.local (from gethostbyname())
aliases:           Jupiter
     address length:  4 bytes
     address:             10.1.1.1   (16843018 dec)  name:  Jupiter.local

------------------------------------------------------------------------------
Here are a very simple config of torque

$ qmgr -c 'p s'

#
# Create queues and set their attributes.
#
#
# Create and define queue default
#
create queue default
set queue default queue_type = Execution
set queue default kill_delay = 90
set queue default enabled = True
set queue default started = True
#
# Set server attributes.
#
set server acl_hosts = jupiter
set server acl_hosts += Jupiter
set server acl_hosts += jupiter.mynetwork.com
set server acl_hosts += jupiter.local
set server acl_hosts += Jupiter.local
set server acl_hosts += Jupiter.mynetwork.com
set server managers = root at jupiter.mynetwork.com
set server managers += root at jupiter.local
set server log_events = 511
set server mail_from = adm
set server resources_default.walltime = 336:00:00
set server scheduler_iteration = 60
set server node_ping_rate = 300
set server node_check_rate = 600
set server tcp_timeout = 6
set server node_pack = False
set server next_job_number = 2

Here are the errors message in torque's log

10/31/2008 22:59:23;0001;PBS_Server;Svr;PBS_Server;req_quejob, requested queue n
ot found
10/31/2008 22:59:23;0080;PBS_Server;Req;req_reject;Reject reply code=15037(No de
fault queue specified MSG=cannot locate queue), aux=0, type=QueueJob, from struo
ng at jupiter.mynetwork.com
10/31/2008 23:03:27;0100;PBS_Server;Job;1.jupiter.mynetwork.com;enqueuing into
 default, state 1 hop 1
10/31/2008 23:03:27;0008;PBS_Server;Job;1.jupiter.mynetwork.com;Job Queued at
request of testuser at jupiter.mynetwork.com, owner = testuser at jupiter.mynetwork.
com, job name = PtPd_3.N.6ML.fcc2.or, queue = default
10/31/2008 23:03:52;0080;PBS_Server;Job;1.jupiter.local;Unknown Job Id
10/31/2008 23:03:52;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unkno
wn Job Id MSG=cannot locate job), aux=0, type=RunJob, from root at Jupiter.local
10/31/2008 23:03:52;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unkno
wn Job Id), aux=0, type=LocateJob, from root at Jupiter.local
10/31/2008 23:05:56;0080;PBS_Server;Job;1.jupiter.local;Unknown Job Id
10/31/2008 23:05:56;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unkno
wn Job Id MSG=cannot locate job), aux=0, type=RunJob, from root at Jupiter.local
10/31/2008 23:05:56;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unkno
wn Job Id), aux=0, type=LocateJob, from root at Jupiter.local
10/31/2008 23:06:05;0080;PBS_Server;Job;1.jupiter.local;Unknown Job Id
10/31/2008 23:06:05;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unkno
wn Job Id MSG=cannot locate job), aux=0, type=RunJob, from root at Jupiter.local
10/31/2008 23:06:05;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unkno
wn Job Id), aux=0, type=LocateJob, from root at Jupiter.local


Please tell me how to fix this.


Thank you very much.


More information about the torqueusers mailing list