[torqueusers] osc mpiexec and torque4
djohnson at osc.edu
Wed Jul 25 10:48:54 MDT 2012
At Wed, 25 Jul 2012 09:06:14 -0700,
Michael Jennings wrote:
> On Wednesday, 25 July 2012, at 11:29:34 (-0400),
> Brock Palen wrote:
> > The OSC mpiexec appears to have issues with torque 4.1.0 but works fine with 2.x
> > Has anyone gotten mpiexec (the popular tm aware launcher for mpich2 and mvapich) to work with torque 4?
> > I have some debugging information below:
> > [brockp at nyx7000 ~]$ /home/software/rhel6/mpiexec/bin/mpiexec -v -v -v ~/a.out
> > mpiexec: stat_exe: testing "/home/brockp/a.out".
> > mpiexec: resolve_exe: using absolute path "/home/brockp/a.out".
> > mpiexec: stdio_notice_streams: aggregate = 0 1 2.
> > mpiexec: concurrent_init: unix socket exists, trying to connect.
> > mpiexec: concurrent_init: old master died, reusing his fifo as master.
> > mpiexec: concurrent_init: i am concurrent master.
> > Segmentation fault
> > (gdb) where
> > #0 0x00000036afd31aff in __strlen_sse42 () from /lib64/libc.so.6
> > #1 0x00002aaaaaac53af in pbs_connect (server_name_ptr=0x0) at ../Libifl/pbsD_connect.c:1256
> > #2 0x0000000000405170 in get_hosts () at get_hosts.c:98
> > #3 0x0000000000403601 in main (argc=1, argv=0x7fffffffd890) at mpiexec.c:700
> > Line 1256 of pbsD_connect.c is:
> > strncat(server_name_list, pbs_get_server_list(),
> > sizeof(server_name_list) -1 - strlen(server_name_ptr) - 1);
> You can try changing the strlen() call to:
> ((server_name_ptr) ? (strlen(server_name_ptr)) : (0))
> but that won't fix the ultimate problem of an invalid server name
> being passed in by mpiexec. (It will, however, make libtorque more
> robust and should probably be done upstream.)
> In fact, I'd change that whole section of code:
> /* Use the list from the server_name file.
> * If a server name is passed in, append it at the beginning. */
> if (server_name_ptr && server_name_ptr)
> snprintf(server_name_list, sizeof(server_name_list), "%s,%s",
> server_name_ptr, pbs_get_server_list());
> strncat(server_name_list, pbs_get_server_list(),
> sizeof(server_name_list) - 1);
> if (getenv("PBSDEBUG"))
> fprintf(stderr, "pbs_connect using following server list \"%s\"\n",
> > Examining server_name_list and server_name_ptr I get interesting results:
> > (gdb) x server_name_list
> > 0x7fffffffc5f0: 0x00000000
> > (gdb) printf "%s", server_name_list
> > (nothing returned by gdb)
> > (gdb) x server_name_ptr
> > 0x0: Cannot access memory at address 0x0
> > The empty string of server_name_list and the cannot access memory
> > appear strange to me, but I am not sure.
> There's at least one bug in terms of lack of robustness in not
> handling the case where server_name_ptr is NULL. However, there's
> another problem somewhere regarding why it's NULL to begin with. Have
> you looked at the get_hosts() code (in mpiexec) at all? That's
> probably where I'd start. Once you/we figure out why that variable is
> being passed to pbs_connect() as NULL, we should have a better idea
> what's going awry.
Has pbs_connect changed in torque 4? From the man page,
If the parameter, server, is either the null string or a null
pointer, a connection will be opened to the default server. The
default server is defined by (a) the setting of the environment
variable PBS_DEFAULT which contains a destination, or (b) the desti-
nation in the batch administrator established file
Either something is wrong in Brock's environment, or pbs_connect does
not work the same in torque 4.0. I agree that more error checking in
pbs_connect on the client side is probably needed.
Here are the relevant lines from mpiexec,
* Now go talk to PBS. Get the hostnames in the job and compress it
* down to our idea of nodes, matching up against the tasklist as we go.
fd = pbs_connect(0);
if (fd < 0)
error_pbs("%s: pbs_connect", __func__);
The pbs_connect succeeds. I'm not sure what more error checking could
be done in mpiexec.
More information about the torqueusers