[torqueusers] GT-4.0.7/Torque-2.2.1/MPICH-G2 problems

Adam Scovel Z167344 at students.niu.edu
Tue Apr 7 10:15:34 MDT 2009


Hi all,

I was hoping to get some help running mpich-g2 apps on a cluster via WS-GRAM and Torque.

Synopsis:
Torque 2.2.1 installed as root.  Gentoo; USE flags 'crypt' and 'server'(I believe 'crypt' causes it to use ssh instead of rsh)

Globus 4.0.7(source) and mpich 1.2.7p1 installed as 'globus' user

Non-mpi jobs submitted via WS-GRAM/globusrun-ws using a multiJob-style RSL run as expected.  As soon as I add <jobType>mpi</jobType> to the RSL things go south.  I should note that I have used mpirun manually to successfully run 'ring.c' (hence submitting a Pre-WS RSL to gsigatekeeper works as well).

Examples:
# 1)
# run /bin/hostname on each compute-node w/ PBS
# scheduler
globusrun-ws -submit -J -f hostname.multi.xml

# output; these are the 2 compute-nodes
tp-x002.ci.uchicago.edu
tp-x003.ci.uchicago.edu

# 2)
# run mpich-g2 test app 'ring.c' with <jobType>mpi</jobType>
globusrun-ws -submit -J -f ring.multi.xml

# output
    Submission of subjob (label = "subjob 0") failed because the connection to the server failed (check host and port) (error code 62)
    Submission of subjob (label = "subjob 1") failed because the connection to the server failed (check host and port) (error code 62)

The container and torque logs don't show any obvious red flags, and 'globusrun-ws -status' checks appear to show an errorless run.

Any help debugging this would be greatly appreciated.  I can post logs if necessary.

-Adam

--
This message was sent with an unlicensed evaluation version of
Novell NetMail. Please see http://www.netmail.com/ for details.



More information about the torqueusers mailing list