[torqueusers] hanging jobs/communication error with mixed network environments

David Backeberg david.backeberg at case.edu
Thu Jul 5 10:31:18 MDT 2007


Good diagnosis. It sounds like you should examine how the nodes reach
each other.

These machinenames should be set in /etc/hosts, and you should configure
/etc/nsswitch.conf
hosts: line to consult files before dns.

You've probably set your Torque server to only listen on the internal
address, and that's what 02 should use to talk to 00. Make sure your
Torque server machines list matches the machine names you give in
/etc/hosts also. You should probably give those machines different
internal names than what can be addressed from the public interfaces,
just for your own sanity in stopping this problem. If you don't know
how to do this, read about /etc/hosts.

-Dave

On 7/4/07, Adrian Knoth <adi at drcomp.erfurt.thur.de> wrote:
> Hi!
>
> I'm completely new to torque and facing a weird problem:
>
> We have three nodes, racl00, racl01 and racl02, where racl00 is running
> the torque server (2.1.8). racl00 and racl01 have both, public and
> private IPv4 addresses, racl02 only has private addresses.
>
> Running single-node jobs works fine, they're correctly distributed to
> all three nodes.
>
> When I try to run a job across two nodes (qsub -l nodes=2), and racl00
> and racl02 are allocated, job execution hangs:
>
> >From momctl -d 3 on racl02:
> job[93.racl00.inf-ra.uni-jena.de]  state=PRERUN  sidlist=
>
> >From momctl -d 3 on racl00:
> NOTE:  no local jobs detected
>
> tcpdump shows UDP communication between racl00's public and racl02's private
> address. If I remove racl02's default route, thus making racl00's public
> address unreachable, I'm getting
>
>    pbs_mom;sister could not communicate
>
> on racl02. In addition, there is also a lot of "Premature end of
> message" and ABORT output in racl02's mom logfile.
>
>
> Spawning jobs across racl01<->racl02 or racl00<->racl01 works fine.
> racl02's default gateway is racl01. Nevertheless, all three nodes share
> the same private network (192.168.1.0/24 and 192.168.3.0/24) on two
> additional internal Ethernet segments.
>
>
> Limiting all communication to 192.168.1.0/24 would probably solve the
> issue. How can this be achieved? Do you see other solutions?
>
>
>
> TIA
>
>
> PS: All nodes also have public IPv6 connectivity, but at least the
> public torque 2.1.8 source code does not support IPv6 at all. Is there
> already some kind of internal development towards IPv6 support, and if
> not, would it be possible to contribute?
>
>
> --
> mail: adi at thur.de       http://adi.thur.de      PGP/GPG: key via keyserver
>
> Bringen Sie die Loesung, oder sind Sie selbst das Problem?
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>


More information about the torqueusers mailing list