[torqueusers] osc mpiexec and torque4

Michael Jennings mej at lbl.gov
Wed Jul 25 10:06:14 MDT 2012


On Wednesday, 25 July 2012, at 11:29:34 (-0400),
Brock Palen wrote:

> The OSC mpiexec appears to have issues with torque 4.1.0  but works fine with 2.x
> 
> Has anyone gotten mpiexec (the popular tm aware launcher for mpich2 and mvapich) to work with torque 4?
> 
> I have some debugging information below:
> 
> [brockp at nyx7000 ~]$ /home/software/rhel6/mpiexec/bin/mpiexec -v -v -v ~/a.out
> mpiexec: stat_exe: testing "/home/brockp/a.out".
> mpiexec: resolve_exe: using absolute path "/home/brockp/a.out".
> mpiexec: stdio_notice_streams: aggregate = 0 1 2.
> mpiexec: concurrent_init: unix socket exists, trying to connect.
> mpiexec: concurrent_init: old master died, reusing his fifo as master.
> mpiexec: concurrent_init: i am concurrent master.
> Segmentation fault
> 
> 
> (gdb) where
> #0  0x00000036afd31aff in __strlen_sse42 () from /lib64/libc.so.6
> #1  0x00002aaaaaac53af in pbs_connect (server_name_ptr=0x0) at ../Libifl/pbsD_connect.c:1256
> #2  0x0000000000405170 in get_hosts () at get_hosts.c:98
> #3  0x0000000000403601 in main (argc=1, argv=0x7fffffffd890) at mpiexec.c:700
> 
> 
> Line 1256 of pbsD_connect.c  is:
>  strncat(server_name_list, pbs_get_server_list(),
>      sizeof(server_name_list) -1 - strlen(server_name_ptr) - 1);

You can try changing the strlen() call to:

((server_name_ptr) ? (strlen(server_name_ptr)) : (0))

but that won't fix the ultimate problem of an invalid server name
being passed in by mpiexec.  (It will, however, make libtorque more
robust and should probably be done upstream.)

In fact, I'd change that whole section of code:

  /* Use the list from the server_name file.
   * If a server name is passed in, append it at the beginning. */

  if (server_name_ptr && server_name_ptr[0])
    {
    snprintf(server_name_list, sizeof(server_name_list), "%s,%s",
	     server_name_ptr, pbs_get_server_list());
    }
  else
    {
    strncat(server_name_list, pbs_get_server_list(),
            sizeof(server_name_list) - 1);
    }

  if (getenv("PBSDEBUG"))
    fprintf(stderr, "pbs_connect using following server list \"%s\"\n",
        server_name_list);

> Examining server_name_list and server_name_ptr I get interesting results:
> 
> (gdb) x server_name_list
> 0x7fffffffc5f0:	0x00000000
> (gdb) printf "%s", server_name_list
> (nothing returned by gdb)
> 
> (gdb) x server_name_ptr
> 0x0:	Cannot access memory at address 0x0
> 
> The empty string of server_name_list and the cannot access memory
> appear strange to me, but I am not sure.

There's at least one bug in terms of lack of robustness in not
handling the case where server_name_ptr is NULL.  However, there's
another problem somewhere regarding why it's NULL to begin with.  Have
you looked at the get_hosts() code (in mpiexec) at all?  That's
probably where I'd start.  Once you/we figure out why that variable is
being passed to pbs_connect() as NULL, we should have a better idea
what's going awry.

HTH,
Michael

-- 
Michael Jennings <mej at lbl.gov>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E        W: 510-495-2687
MS 050B-3209          F: 510-486-8615


More information about the torqueusers mailing list