[torqueusers] MPI job not able to run after upgraded to 4.1.5.1

Steven Lo slo at cacr.caltech.edu
Tue Oct 15 17:17:23 MDT 2013


We have 2 other clusters which run the same version (4.1.5.1) and they 
both work fine.
It's nice to know why this cluster is having problem.  This is a much 
older cluster than
the other 2 and not sure if the version of MPI has anything to do with it.

If we can isolate where the problem is (either torque, maui or MPI), it 
will help a lot.
 From the output of the strace, can you help identify where the problem 
occur?

Thanks.

Steven.


On 10/15/2013 04:07 PM, Ken Nielson wrote:
> I ran the example against 4.2-dev and it works.
>
> The 4.1.x branch is going to be deprecated soon. Is there a reason you 
> don't try 4.2.5.h3?
>
> Ken
>
>
> On Tue, Oct 15, 2013 at 4:04 PM, Steven Lo <slo at cacr.caltech.edu 
> <mailto:slo at cacr.caltech.edu>> wrote:
>
>
>     Hi,
>
>     We just upgraded our Torque server to 4.1.5.1 and we are having
>     trouble of running a simple MPI
>     program with just 2 nodes:
>
>     /* C Example of hello_world with 4 basic mpi calls */
>     #include <stdio.h>
>     #include <mpi.h>
>
>
>     int main (argc, argv)
>          int argc;
>          char *argv[];
>     {
>       int rank, size;
>
>       MPI_Init (&argc, &argv);    /* starts MPI */
>       MPI_Comm_rank (MPI_COMM_WORLD, &rank);    /* get current process
>     id */
>       MPI_Comm_size (MPI_COMM_WORLD, &size);    /* get number of
>     processes */
>       printf( "Hello world from process %d of %d\n", rank, size );
>       MPI_Finalize();
>       return 0;
>     }
>
>
>     >From the 'strace' (output attached), it looks like the pbs_mom
>     (on 172.18.1.188) is not able to
>     communicate properly with the sister node (172.18.1.172):
>
>         shc172 - daemon did not report back when launched
>
>
>
>     The 'qstat -n' shows the job is queued properly:
>
>     shc188: qstat -n
>
>     maitre:
>     Req'd    Req'd      Elap
>     Job ID               Username    Queue    Jobname  SessID NDS  
>     TSK    Memory   Time   S   Time
>     -------------------- ----------- -------- ---------------- ------
>     ----- ------ ------ -------- - --------
>     444347.maitre        sharon      weekdayQ STDIN 0     2  2    --
>      00:30:00 R      --
>        shc188/0+shc172/0
>
>
>
>     The section in the strace which puzzle us is the following:
>
>     socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 12
>     setsockopt(12, SOL_SOCKET, SO_LINGER, {onoff=1, linger=5}, 8) = 0
>     connect(12, {sa_family=AF_INET, sin_port=htons(15003),
>     sin_addr=inet_addr("127.0.0.1")}, 16) = 0
>     mmap(NULL, 528384, PROT_READ|PROT_WRITE,
>     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b3a50008000
>     write(12, "+2+12+13444344.maitre2+32F8908E4"..., 66) = 66
>     poll([{fd=12, events=POLLIN|POLLHUP}], 1, 2147483647
>     <tel:2147483647>) = 1 ([{fd=12, revents=POLLIN}])
>     fcntl(12, F_GETFL)                      = 0x2 (flags O_RDWR)
>     read(12, "", 262144)                    = 0
>     close(12)                               = 0
>     poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
>     events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}], 5,
>     1000) = 0 (Timeout)
>     sched_yield()                           = 0
>     poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
>     events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}], 5,
>     1000) = 0 (Timeout)
>     sched_yield()                           = 0
>         .
>         .
>         .
>
>
>
>     I don't think the firewall is an issue since we are allowing all
>     packets from/to nodes in the private network.
>
>
>     Your suggestion on how to debug is much appreciated.
>
>
>     Thanks.
>
>     Steven.
>
>
>
>     _______________________________________________
>     torqueusers mailing list
>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>     http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> -- 
> Ken Nielson
> +1 801.717.3700 office +1 801.717.3738 fax
> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
> www.adaptivecomputing.com <http://www.adaptivecomputing.com>
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131015/9ec7a10c/attachment.html 


More information about the torqueusers mailing list