[torqueusers] Remote submit host: qsub -I fails

Bill Wichser bill at Princeton.EDU
Mon Jun 22 06:46:30 MDT 2009


After much experimentation, this turns out to be a simple issue with 
permissions which I overlooked in my panic to make things work over the 
weekend for the new cluster.  Client machines were not allowed back onto 
the login node so opening up the entire network in iptables allows these 
clients to connect.  The message of routing/dual homed hosts was a red 
herring here.

Sorry for the unneeded bandwidth.

Bill

Bill Wichser wrote:
> I have a host set up to run the torque server as well as maui scheduler 
> (server).  I also have a login node set up to send jobs to this torque 
> server (login1).  My version of torque is 2.3.6.
> 
> While I can submit jobs fine from this login host, use qstat and showq, 
> I cannot submit an interactive job.  Here is the output:
> 
> % qsub -I -l nodes=2:ppn=1,walltime=10:00
> qsub: waiting for job 39.server to start
> qsub: job 39.server apparently deleted
> 
> While I cannot run a tracejob on this login node, a tracejob on the 
> server shows:
> 
> 
> Job: 39.server
> 
> 06/20/2009 20:53:29  S    enqueuing into default, state 1 hop 1
> 06/20/2009 20:53:29  S    dequeuing from default, state QUEUED
> 06/20/2009 20:53:29  S    enqueuing into short, state 1 hop 1
> 06/20/2009 20:53:29  S    Job Queued at request of bill at login1, owner =
>                            bill at login1, job name = STDIN, queue = short
> 06/20/2009 20:53:29  A    queue=default
> 06/20/2009 20:53:29  A    queue=short
> 06/20/2009 20:53:30  S    Job Modified at request of root at server
> 06/20/2009 20:53:30  S    Job Run at request of root at server
> 06/20/2009 20:53:30  S    Job Modified at request of root at server
> 06/20/2009 20:53:30  S    Exit_status=-1 resources_used.cput=00:00:00
>                            resources_used.mem=0kb resources_used.vmem=0kb
>                            resources_used.walltime=00:00:00 
> Error_Path=/dev/pts/0
>                            Output_Path=/dev/pts/0
> 
> Note that Exit_status=-1  which in one discussion on this list referred 
> to an /etc/resolv.conf issue.
> 
> Checking /var/log/messages on a node, I find pbs_mom spitting out info 
> about my multihomed host:
> 
> Jun 20 21:27:55 r6c2n1 pbs_mom: No route to host (113) in 
> TMomFinalizeChild, cannot open interactive qsub socket to host 
> login1:52427 - 'cannot bind to port 1023 in client_to_svr - connection 
> refused' - check routing tables/multi-homed host issues
> 
> Both my server and login nodes are multi-homed.  Everyone has local 
> addresses in /etc/hosts.  I've added to /var/spool/PBS/torque.cfg a line:
> SERVERHOST  server
> on my server, believing that a string is needed here rather than an 
> actual IP.  Regardless, the interactive session is trying to get back to 
> a remote submit host which is also multihomed.
> 
> Before I tread down the path of assigning a different hostname for the 
> local network (login1-sn23 say), does anyone have any experience with 
> this type of setup?  Am I onto the right path here?
> 
> Thanks,
> Bill
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list