[torqueusers] MPI job not able to run after upgraded to 4.1.5.1

Steven Lo slo at cacr.caltech.edu
Tue Oct 15 16:10:34 MDT 2013


Forget to include the job commands:

shc-b: qsub -I -l nodes=2:core8:ppn=1
qsub: waiting for job 444347.maitre to start
qsub: job 444347.maitre ready

[ using intel_11.1 ]
[ using openmpi_1.4.3_intel_11.1 ]
[ using totalview_8.7 ]

shc188: mpirun -np 2 ./hello_world


On 10/15/2013 03:04 PM, Steven Lo wrote:
>
> Hi,
>
> We just upgraded our Torque server to 4.1.5.1 and we are having 
> trouble of running a simple MPI
> program with just 2 nodes:
>
> /* C Example of hello_world with 4 basic mpi calls */
> #include <stdio.h>
> #include <mpi.h>
>
>
> int main (argc, argv)
>      int argc;
>      char *argv[];
> {
>   int rank, size;
>
>   MPI_Init (&argc, &argv);    /* starts MPI */
>   MPI_Comm_rank (MPI_COMM_WORLD, &rank);    /* get current process id */
>   MPI_Comm_size (MPI_COMM_WORLD, &size);    /* get number of processes */
>   printf( "Hello world from process %d of %d\n", rank, size );
>   MPI_Finalize();
>   return 0;
> }
>
>
> From the 'strace' (output attached), it looks like the pbs_mom (on 
> 172.18.1.188) is not able to
> communicate properly with the sister node (172.18.1.172):
>
>     shc172 - daemon did not report back when launched
>
>
>
> The 'qstat -n' shows the job is queued properly:
>
> shc188: qstat -n
>
> maitre:
> Req'd    Req'd      Elap
> Job ID               Username    Queue    Jobname          SessID 
> NDS   TSK    Memory   Time   S   Time
> -------------------- ----------- -------- ---------------- ------ 
> ----- ------ ------ -------- - --------
> 444347.maitre        sharon      weekdayQ STDIN 0     2      2 --  
> 00:30:00 R      --
>    shc188/0+shc172/0
>
>
>
> The section in the strace which puzzle us is the following:
>
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 12
> setsockopt(12, SOL_SOCKET, SO_LINGER, {onoff=1, linger=5}, 8) = 0
> connect(12, {sa_family=AF_INET, sin_port=htons(15003), 
> sin_addr=inet_addr("127.0.0.1")}, 16) = 0
> mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, 
> -1, 0) = 0x2b3a50008000
> write(12, "+2+12+13444344.maitre2+32F8908E4"..., 66) = 66
> poll([{fd=12, events=POLLIN|POLLHUP}], 1, 2147483647) = 1 ([{fd=12, 
> revents=POLLIN}])
> fcntl(12, F_GETFL)                      = 0x2 (flags O_RDWR)
> read(12, "", 262144)                    = 0
> close(12)                               = 0
> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, 
> events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}], 5, 
> 1000) = 0 (Timeout)
> sched_yield()                           = 0
> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, 
> events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}], 5, 
> 1000) = 0 (Timeout)
> sched_yield()                           = 0
>     .
>     .
>     .
>
>
>
> I don't think the firewall is an issue since we are allowing all 
> packets from/to nodes in the private network.
>
>
> Your suggestion on how to debug is much appreciated.
>
>
> Thanks.
>
> Steven.
>
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131015/5dfcbcd6/attachment.html 


More information about the torqueusers mailing list