[torqueusers] hanging jobs/communication error with mixed network environments

Adrian Knoth adi at drcomp.erfurt.thur.de
Wed Jul 4 09:01:11 MDT 2007


I'm completely new to torque and facing a weird problem:

We have three nodes, racl00, racl01 and racl02, where racl00 is running
the torque server (2.1.8). racl00 and racl01 have both, public and
private IPv4 addresses, racl02 only has private addresses.

Running single-node jobs works fine, they're correctly distributed to
all three nodes.

When I try to run a job across two nodes (qsub -l nodes=2), and racl00
and racl02 are allocated, job execution hangs:

>From momctl -d 3 on racl02:
job[93.racl00.inf-ra.uni-jena.de]  state=PRERUN  sidlist=

>From momctl -d 3 on racl00:
NOTE:  no local jobs detected

tcpdump shows UDP communication between racl00's public and racl02's private
address. If I remove racl02's default route, thus making racl00's public
address unreachable, I'm getting

   pbs_mom;sister could not communicate

on racl02. In addition, there is also a lot of "Premature end of
message" and ABORT output in racl02's mom logfile.

Spawning jobs across racl01<->racl02 or racl00<->racl01 works fine.
racl02's default gateway is racl01. Nevertheless, all three nodes share
the same private network ( and on two
additional internal Ethernet segments.

Limiting all communication to would probably solve the
issue. How can this be achieved? Do you see other solutions?


PS: All nodes also have public IPv6 connectivity, but at least the
public torque 2.1.8 source code does not support IPv6 at all. Is there
already some kind of internal development towards IPv6 support, and if
not, would it be possible to contribute?

mail: adi at thur.de  	http://adi.thur.de	PGP/GPG: key via keyserver

