[torqueusers] Jobs staying in "Running" state after completion

Adam Emerich aemerich at us.ibm.com
Mon Feb 25 06:45:29 MST 2008


We recently made some major changes to the network structure of our
cluster.  These changes included moving all the Torque/Moab traffic onto a
hybrid of Infiniband and GigE interfaces such that:

Torque Server <- GigE -> IONode (Infiniband Gateway) <- Infiniband ->
Compute Node

To do this we created a new network on the management node where the torque
server lives.  This new network is 10.11.x.x.  We then created a route to
point traffic to the compute nodes to this network and out the GigE
interface.  On the compute nodes we had to create a new hostname for the
"server_name" (mnc-ib0).  We also added a route for the 10.11.x.x network
to go out over IB to the IONode.  To get Torque working we added the
following to the mom_priv/config file:

$clienthost 10.10.4.201
$restricted 10.10.4.201

These allow traffic to come from the IONode.  This all seemed to be working
good as we can schedule and run jobs, however if I run an interactive job
and then "exit" from it.  The job never gets cleaned up in the torque queue
and thus doesn't allow another job to get scheduled to that node.  A simple
qdel of the job does not clean it up.  It requires a "qdel -p".

Is there anything we are missing in this setup.  This pairing of Torque and
Moab worked fine before the network changes.

Kind Regards

Adam Emerich



More information about the torqueusers mailing list