[torqueusers] GT-4.0.7/Torque-2.2.1/MPICH-G2 problems
Z167344 at students.niu.edu
Tue Apr 7 10:15:34 MDT 2009
I was hoping to get some help running mpich-g2 apps on a cluster via WS-GRAM and Torque.
Torque 2.2.1 installed as root. Gentoo; USE flags 'crypt' and 'server'(I believe 'crypt' causes it to use ssh instead of rsh)
Globus 4.0.7(source) and mpich 1.2.7p1 installed as 'globus' user
Non-mpi jobs submitted via WS-GRAM/globusrun-ws using a multiJob-style RSL run as expected. As soon as I add <jobType>mpi</jobType> to the RSL things go south. I should note that I have used mpirun manually to successfully run 'ring.c' (hence submitting a Pre-WS RSL to gsigatekeeper works as well).
# run /bin/hostname on each compute-node w/ PBS
globusrun-ws -submit -J -f hostname.multi.xml
# output; these are the 2 compute-nodes
# run mpich-g2 test app 'ring.c' with <jobType>mpi</jobType>
globusrun-ws -submit -J -f ring.multi.xml
Submission of subjob (label = "subjob 0") failed because the connection to the server failed (check host and port) (error code 62)
Submission of subjob (label = "subjob 1") failed because the connection to the server failed (check host and port) (error code 62)
The container and torque logs don't show any obvious red flags, and 'globusrun-ws -status' checks appear to show an errorless run.
Any help debugging this would be greatly appreciated. I can post logs if necessary.
This message was sent with an unlicensed evaluation version of
Novell NetMail. Please see http://www.netmail.com/ for details.
More information about the torqueusers