[torqueusers] TM interface - MOM daemon on the node dies when
tm_init is called
Prakash Velayutham
velayups at email.uc.edu
Sat Apr 8 13:13:50 MDT 2006
Hi,
This is a new thread on the same TM topic as I have a different aspect to discuss.
I am still testing with the same MPI program given here:
#include <stdio.h>
#include <tm.h>
#include <mpi.h>
extern char **environ;
void do_check(int val, char *msg) {
if (TM_SUCCESS != val) {
printf("ret is %d instead of %d: %s\n", val, TM_SUCCESS, msg);
exit(1);
}
}
main (int argc, char *argv[]) {
int size, rank, ret, err, numnodes, local_err;
MPI_Status status;
char **input;
input[0] = "/bin/echo";
input[1] = "Hello There";
struct tm_roots task_root;
tm_node_id *nodelist;
tm_event_t event;
tm_task_id task_id;
char hostname[64];
char buf[]="11000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000";
gethostname(hostname, 64);
ret = MPI_Init (&argc, &argv);
if (ret) {
printf ("Error: %d\n", ret);
return (1);
}
ret = MPI_Comm_size (MPI_COMM_WORLD, &size);
if (ret) {
printf("Error: %d\n", ret);
return (1);
}
ret = MPI_Comm_rank (MPI_COMM_WORLD, &rank);
if (ret) {
printf("Error: %d\n", ret);
return (1);
}
printf ("First Hostname: %s node %d out of %d\n", hostname, rank, size);
if (size%2 && rank==size-1)
printf("Sitting out\n");
else {
if (rank%2==0)
MPI_Send(buf, strlen(buf), MPI_BYTE, rank+1, 11, MPI_COMM_WORLD);
else
MPI_Recv(buf, sizeof(buf), MPI_BYTE, rank-1, 11, MPI_COMM_WORLD, &status);
}
printf ("Second Hostname: %s node %d out of %d\n", hostname, rank, size);
if (rank == 1) {
ret = tm_init(NULL, &task_root);
do_check(ret, "tm_init failed");
printf ("Special Hostname: %s node %d out of %d\n", hostname, rank, size);
task_id = 0xdeadbeef;
event = 0xdeadbeef;
printf("%s\t%s", input[0], input[1]);
tm_finalize();
}
MPI_Finalize ();
return (0);
}
The error I am getting is:
First Hostname: wins05 node 0 out of 4
First Hostname: wins03 node 1 out of 4
First Hostname: wins02 node 2 out of 4
First Hostname: wins01 node 3 out of 4
Second Hostname: wins05 node 0 out of 4
Second Hostname: wins02 node 2 out of 4
Second Hostname: wins03 node 1 out of 4
Second Hostname: wins01 node 3 out of 4
tm_poll: protocol number dis error 11
ret is 17002 instead of 0: tm_init failed
3 processes killed (possibly by Open MPI)
I am using Torque-2.0.0p8 and Open MPI-1.0.1. Please note that I am trying to do tm_init from the node that is assigned rank 1 by Open MPI (generally MS gets rank 0). What I am actually noticing is that the MOM daemon on the node with rank 1 actually dies when it reaches the tm_init() call in the code. That was totally unexpected for me. What I can't figure out is that which causes what? Whether my tm_init killed the daemon or the dead daemon caused my getting 17002 error? Why would the MOM daemon die at all because of a tm_init call?
Thanks,
Prakash
More information about the torqueusers
mailing list