[torqueusers] osc mpiexec and torque4

Doug Johnson djohnson at osc.edu
Wed Jul 25 10:48:54 MDT 2012


At Wed, 25 Jul 2012 09:06:14 -0700,
Michael Jennings wrote:
> 
> On Wednesday, 25 July 2012, at 11:29:34 (-0400),
> Brock Palen wrote:
> 
> > The OSC mpiexec appears to have issues with torque 4.1.0  but works fine with 2.x
> > 
> > Has anyone gotten mpiexec (the popular tm aware launcher for mpich2 and mvapich) to work with torque 4?
> > 
> > I have some debugging information below:
> > 
> > [brockp at nyx7000 ~]$ /home/software/rhel6/mpiexec/bin/mpiexec -v -v -v ~/a.out
> > mpiexec: stat_exe: testing "/home/brockp/a.out".
> > mpiexec: resolve_exe: using absolute path "/home/brockp/a.out".
> > mpiexec: stdio_notice_streams: aggregate = 0 1 2.
> > mpiexec: concurrent_init: unix socket exists, trying to connect.
> > mpiexec: concurrent_init: old master died, reusing his fifo as master.
> > mpiexec: concurrent_init: i am concurrent master.
> > Segmentation fault
> > 
> > 
> > (gdb) where
> > #0  0x00000036afd31aff in __strlen_sse42 () from /lib64/libc.so.6
> > #1  0x00002aaaaaac53af in pbs_connect (server_name_ptr=0x0) at ../Libifl/pbsD_connect.c:1256
> > #2  0x0000000000405170 in get_hosts () at get_hosts.c:98
> > #3  0x0000000000403601 in main (argc=1, argv=0x7fffffffd890) at mpiexec.c:700
> > 
> > 
> > Line 1256 of pbsD_connect.c  is:
> >  strncat(server_name_list, pbs_get_server_list(),
> >      sizeof(server_name_list) -1 - strlen(server_name_ptr) - 1);
> 
> You can try changing the strlen() call to:
> 
> ((server_name_ptr) ? (strlen(server_name_ptr)) : (0))
> 
> but that won't fix the ultimate problem of an invalid server name
> being passed in by mpiexec.  (It will, however, make libtorque more
> robust and should probably be done upstream.)
> 
> In fact, I'd change that whole section of code:
> 
>   /* Use the list from the server_name file.
>    * If a server name is passed in, append it at the beginning. */
> 
>   if (server_name_ptr && server_name_ptr[0])
>     {
>     snprintf(server_name_list, sizeof(server_name_list), "%s,%s",
> 	     server_name_ptr, pbs_get_server_list());
>     }
>   else
>     {
>     strncat(server_name_list, pbs_get_server_list(),
>             sizeof(server_name_list) - 1);
>     }
> 
>   if (getenv("PBSDEBUG"))
>     fprintf(stderr, "pbs_connect using following server list \"%s\"\n",
>         server_name_list);
> 
> > Examining server_name_list and server_name_ptr I get interesting results:
> > 
> > (gdb) x server_name_list
> > 0x7fffffffc5f0:	0x00000000
> > (gdb) printf "%s", server_name_list
> > (nothing returned by gdb)
> > 
> > (gdb) x server_name_ptr
> > 0x0:	Cannot access memory at address 0x0
> > 
> > The empty string of server_name_list and the cannot access memory
> > appear strange to me, but I am not sure.
> 
> There's at least one bug in terms of lack of robustness in not
> handling the case where server_name_ptr is NULL.  However, there's
> another problem somewhere regarding why it's NULL to begin with.  Have
> you looked at the get_hosts() code (in mpiexec) at all?  That's
> probably where I'd start.  Once you/we figure out why that variable is
> being passed to pbs_connect() as NULL, we should have a better idea
> what's going awry.
> 

Has pbs_connect changed in torque 4?  From the man page,

       If the parameter, server, is  either  the  null  string  or  a  null
       pointer,  a  connection  will  be opened to the default server.  The
       default server is defined by (a)  the  setting  of  the  environment
       variable PBS_DEFAULT which contains a destination, or (b) the desti-
       nation    in    the    batch    administrator    established    file
       {PBS_DIR}/default_destn.

Either something is wrong in Brock's environment, or pbs_connect does
not work the same in torque 4.0.  I agree that more error checking in
pbs_connect on the client side is probably needed.  

Here are the relevant lines from mpiexec,

    /*
     * Now go talk to PBS.  Get the hostnames in the job and compress it
     * down to our idea of nodes, matching up against the tasklist as we go.
     */
    fd = pbs_connect(0);
    if (fd < 0)
        error_pbs("%s: pbs_connect", __func__);

The pbs_connect succeeds.  I'm not sure what more error checking could
be done in mpiexec.

Doug



More information about the torqueusers mailing list