[torqueusers] MPI job not able to run after upgraded to 4.1.5.1

Ken Nielson knielson at adaptivecomputing.com
Tue Oct 15 17:07:43 MDT 2013


I ran the example against 4.2-dev and it works.

The 4.1.x branch is going to be deprecated soon. Is there a reason you
don't try 4.2.5.h3?

Ken


On Tue, Oct 15, 2013 at 4:04 PM, Steven Lo <slo at cacr.caltech.edu> wrote:

>
> Hi,
>
> We just upgraded our Torque server to 4.1.5.1 and we are having trouble of
> running a simple MPI
> program with just 2 nodes:
>
> /* C Example of hello_world with 4 basic mpi calls */
> #include <stdio.h>
> #include <mpi.h>
>
>
> int main (argc, argv)
>      int argc;
>      char *argv[];
> {
>   int rank, size;
>
>   MPI_Init (&argc, &argv);    /* starts MPI */
>   MPI_Comm_rank (MPI_COMM_WORLD, &rank);    /* get current process id */
>   MPI_Comm_size (MPI_COMM_WORLD, &size);    /* get number of processes */
>   printf( "Hello world from process %d of %d\n", rank, size );
>   MPI_Finalize();
>   return 0;
> }
>
>
> From the 'strace' (output attached), it looks like the pbs_mom (on
> 172.18.1.188) is not able to
> communicate properly with the sister node (172.18.1.172):
>
>     shc172 - daemon did not report back when launched
>
>
>
> The 'qstat -n' shows the job is queued properly:
>
> shc188: qstat -n
>
> maitre:
> Req'd    Req'd      Elap
> Job ID               Username    Queue    Jobname          SessID NDS
> TSK    Memory   Time   S   Time
> -------------------- ----------- -------- ---------------- ------ -----
> ------ ------ -------- - --------
> 444347.maitre        sharon      weekdayQ STDIN 0     2      2    --
>  00:30:00 R      --
>    shc188/0+shc172/0
>
>
>
> The section in the strace which puzzle us is the following:
>
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 12
> setsockopt(12, SOL_SOCKET, SO_LINGER, {onoff=1, linger=5}, 8) = 0
> connect(12, {sa_family=AF_INET, sin_port=htons(15003),
> sin_addr=inet_addr("127.0.0.1"**)}, 16) = 0
> mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
> = 0x2b3a50008000
> write(12, "+2+12+13444344.maitre2+**32F8908E4"..., 66) = 66
> poll([{fd=12, events=POLLIN|POLLHUP}], 1, 2147483647) = 1 ([{fd=12,
> revents=POLLIN}])
> fcntl(12, F_GETFL)                      = 0x2 (flags O_RDWR)
> read(12, "", 262144)                    = 0
> close(12)                               = 0
> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
> {fd=7, events=POLLIN}, {fd=9, events=POLLIN}], 5, 1000) = 0 (Timeout)
> sched_yield()                           = 0
> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
> {fd=7, events=POLLIN}, {fd=9, events=POLLIN}], 5, 1000) = 0 (Timeout)
> sched_yield()                           = 0
>     .
>     .
>     .
>
>
>
> I don't think the firewall is an issue since we are allowing all packets
> from/to nodes in the private network.
>
>
> Your suggestion on how to debug is much appreciated.
>
>
> Thanks.
>
> Steven.
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
Ken Nielson
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
www.adaptivecomputing.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131015/6e1389d8/attachment.html 


More information about the torqueusers mailing list