[torqueusers] Job does not run - Torque 2.5.11

Gus Correa gus at ldeo.columbia.edu
Tue Jun 19 15:16:11 MDT 2012


On 06/19/2012 04:38 PM, John Keller wrote:
> Hi all,
> Problem solved: I noticed on the mom_log that no job activity was
> recorded when the jobs were submitted on the server. So something was
> blocking this traffic. So, I disabled the FIREEWLL on the compute
> node, and now jobs are running normally.
> These systems are all behind a university firewall, so its doubtful
> that this is necessary - or, I could specify a port or ports to allow
> the PBS traffic.
> Thank you for your suggestions.
> John Keller
>

Hi John

A common setup in a cluster is to have a head node with
one IP/interface/port in a public network [or in the organization
private network], and another one in a private [i.e. private to the
cluster] subnet.  The compute nodes have IP/interface/port also
in this cluster-private subnet [say,
192.168.1.0/255.255.255.0], but normally they don't have public
IP [or organization/university-wide IP].
Typically a switch [or a bunch of switches]
implements this cluster-private subnet.
The head node can provide NAT from the compute nodes to the
public side of the head node, if this is useful
[say, to allow the nodes to reach a license server, etc].

Firewalls in general cause more headaches than joy
in the compute nodes.  The problem you saw with the Torque moms
will probably happen with MPI, NFS, etc, and you might end up with a
complex set of firewall rules to open various specific ports
for a number of different items.
You can keep a firewall on the head node, if you want,
besides the university/organization firewall.

However, if your nodes are piggy-backing the university/organization
network, and there is no cluster-private subnet, then you may
need to check if there are university-wide policies requiring
firewalls, before you remove them.
If firewalls are required things may become more complicated.
Otherwise, you could try the scheme above also.
However, [beowulf] clusters work in a much simpler way
if there is only one login/entry point:
the head node [which could also be a few login nodes],
with the compute nodes hidden in a cluster-private
subnet.

Anyway, these are guesses.
I don't even know if you are trying to setup a beowulf
cluster or something else.
Everything may depend on how your
Torque server [head node?] and your mom sisters [compute nodes]
are connected to each other, and how flexible this scheme is
to be adjusted to fit your goals.
It also depends on what you are trying to achieve.
[E.g. running parallel jobs, or running only serial jobs as a farm
of nodes, allowing or not allowing direct login to the compute nodes
via public/university IP, or single login point via head node, etc].

I hope this helps,
Gus Correa


>
> On Sun, Jun 17, 2012 at 5:19 PM, John Keller<jwkeller at alaska.edu>  wrote:
>> Hi all,
>> OK here's a newby question:
>> I installed Torque 2.5.11 on server chemlinux5, installed mom on
>> another openSUSE 12.1 system chemlinux2. I just have the default batch
>> setting. When a user attempts the test job>echo "sleep 30" | qsub ,
>> the job appears on the qstat list as "R" status, and nothing happens.
>> I assume that "sleep 30" should be echo-ed on the terminal (although
>> the manual does not say what should happen). Attached is a screenshot
>> of current status.
>> Thanks for any suggestions,
>> John Keller
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list