[torqueusers] Remote submit host: qsub -I fails

Smith, Jerry Don II jdsmit at sandia.gov
Sat Jun 20 22:34:22 MDT 2009


Does the internal hostname of your server resolve to the one you assigned in $PBS_HOME/server?

And is it the first assigned alias for that machine in /etc/hosts?

--Jerry


----- Original Message -----
From: torqueusers-bounces at supercluster.org <torqueusers-bounces at supercluster.org>
To: torqueusers at supercluster.org <torqueusers at supercluster.org>
Sent: Sat Jun 20 19:36:34 2009
Subject: [torqueusers] Remote submit host:  qsub -I fails

I have a host set up to run the torque server as well as maui scheduler 
(server).  I also have a login node set up to send jobs to this torque 
server (login1).  My version of torque is 2.3.6.

While I can submit jobs fine from this login host, use qstat and showq, 
I cannot submit an interactive job.  Here is the output:

% qsub -I -l nodes=2:ppn=1,walltime=10:00
qsub: waiting for job 39.server to start
qsub: job 39.server apparently deleted

While I cannot run a tracejob on this login node, a tracejob on the 
server shows:


Job: 39.server

06/20/2009 20:53:29  S    enqueuing into default, state 1 hop 1
06/20/2009 20:53:29  S    dequeuing from default, state QUEUED
06/20/2009 20:53:29  S    enqueuing into short, state 1 hop 1
06/20/2009 20:53:29  S    Job Queued at request of bill at login1, owner =
                           bill at login1, job name = STDIN, queue = short
06/20/2009 20:53:29  A    queue=default
06/20/2009 20:53:29  A    queue=short
06/20/2009 20:53:30  S    Job Modified at request of root at server
06/20/2009 20:53:30  S    Job Run at request of root at server
06/20/2009 20:53:30  S    Job Modified at request of root at server
06/20/2009 20:53:30  S    Exit_status=-1 resources_used.cput=00:00:00
                           resources_used.mem=0kb resources_used.vmem=0kb
                           resources_used.walltime=00:00:00 
Error_Path=/dev/pts/0
                           Output_Path=/dev/pts/0

Note that Exit_status=-1  which in one discussion on this list referred 
to an /etc/resolv.conf issue.

Checking /var/log/messages on a node, I find pbs_mom spitting out info 
about my multihomed host:

Jun 20 21:27:55 r6c2n1 pbs_mom: No route to host (113) in 
TMomFinalizeChild, cannot open interactive qsub socket to host 
login1:52427 - 'cannot bind to port 1023 in client_to_svr - connection 
refused' - check routing tables/multi-homed host issues

Both my server and login nodes are multi-homed.  Everyone has local 
addresses in /etc/hosts.  I've added to /var/spool/PBS/torque.cfg a line:
SERVERHOST  server
on my server, believing that a string is needed here rather than an 
actual IP.  Regardless, the interactive session is trying to get back to 
a remote submit host which is also multihomed.

Before I tread down the path of assigning a different hostname for the 
local network (login1-sn23 say), does anyone have any experience with 
this type of setup?  Am I onto the right path here?

Thanks,
Bill
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list