[torqueusers] Remote submit host: qsub -I fails

Bill Wichser bill at Princeton.EDU
Sun Jun 21 06:59:03 MDT 2009


Meaning what?  An nslookup on the hostname assigned to my internal 
network comes back with a FQDN from the nameserver on the external.  I 
run no nameserver on the internal.  My nodes have no /etc/resolv.conf 
and fail with any kind of DNS lookup  (nslookup).  They are only aware 
of the internal network.

In /etc/hosts, I have but one entry for my server host, distributed 
across the entire cluster.  There are no aliases here.  Remember, qsub 
with no interactive flag works fine.

Bill

Smith, Jerry Don II wrote:
> Does the internal hostname of your server resolve to the one you assigned in $PBS_HOME/server?
> 
> And is it the first assigned alias for that machine in /etc/hosts?
> 
> --Jerry
> 
> 
> ----- Original Message -----
> From: torqueusers-bounces at supercluster.org <torqueusers-bounces at supercluster.org>
> To: torqueusers at supercluster.org <torqueusers at supercluster.org>
> Sent: Sat Jun 20 19:36:34 2009
> Subject: [torqueusers] Remote submit host:  qsub -I fails
> 
> I have a host set up to run the torque server as well as maui scheduler 
> (server).  I also have a login node set up to send jobs to this torque 
> server (login1).  My version of torque is 2.3.6.
> 
> While I can submit jobs fine from this login host, use qstat and showq, 
> I cannot submit an interactive job.  Here is the output:
> 
> % qsub -I -l nodes=2:ppn=1,walltime=10:00
> qsub: waiting for job 39.server to start
> qsub: job 39.server apparently deleted
> 
> While I cannot run a tracejob on this login node, a tracejob on the 
> server shows:
> 
> 
> Job: 39.server
> 
> 06/20/2009 20:53:29  S    enqueuing into default, state 1 hop 1
> 06/20/2009 20:53:29  S    dequeuing from default, state QUEUED
> 06/20/2009 20:53:29  S    enqueuing into short, state 1 hop 1
> 06/20/2009 20:53:29  S    Job Queued at request of bill at login1, owner =
>                            bill at login1, job name = STDIN, queue = short
> 06/20/2009 20:53:29  A    queue=default
> 06/20/2009 20:53:29  A    queue=short
> 06/20/2009 20:53:30  S    Job Modified at request of root at server
> 06/20/2009 20:53:30  S    Job Run at request of root at server
> 06/20/2009 20:53:30  S    Job Modified at request of root at server
> 06/20/2009 20:53:30  S    Exit_status=-1 resources_used.cput=00:00:00
>                            resources_used.mem=0kb resources_used.vmem=0kb
>                            resources_used.walltime=00:00:00 
> Error_Path=/dev/pts/0
>                            Output_Path=/dev/pts/0
> 
> Note that Exit_status=-1  which in one discussion on this list referred 
> to an /etc/resolv.conf issue.
> 
> Checking /var/log/messages on a node, I find pbs_mom spitting out info 
> about my multihomed host:
> 
> Jun 20 21:27:55 r6c2n1 pbs_mom: No route to host (113) in 
> TMomFinalizeChild, cannot open interactive qsub socket to host 
> login1:52427 - 'cannot bind to port 1023 in client_to_svr - connection 
> refused' - check routing tables/multi-homed host issues
> 
> Both my server and login nodes are multi-homed.  Everyone has local 
> addresses in /etc/hosts.  I've added to /var/spool/PBS/torque.cfg a line:
> SERVERHOST  server
> on my server, believing that a string is needed here rather than an 
> actual IP.  Regardless, the interactive session is trying to get back to 
> a remote submit host which is also multihomed.
> 
> Before I tread down the path of assigning a different hostname for the 
> local network (login1-sn23 say), does anyone have any experience with 
> this type of setup?  Am I onto the right path here?
> 
> Thanks,
> Bill
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 


More information about the torqueusers mailing list