[torquedev] Problem with TM interface in Torque 2.1.0p0

Jeff Squyres (jsquyres) jsquyres at cisco.com
Thu May 18 15:37:05 MDT 2006


Greetings (apologies if this comes across twice; I initially sent this
from an address that was not subscribed and it disappeared into the
ether -- I don't know if it was discarded or sent to moderation).

Responding to a user post on the Open MPI mailing list, we may have a
found a bug in the TM interface in Torque 2.1.0p0.  According to the tm
man page:

-----
tm_nodeinfo()  places a pointer to a malloc'ed array of tm_node_id's in
the pointer pointed at by list.  The order of the tm_node_id's in  list
is the same as that specified to MOM in the "exec_host" attribute.  The
int pointed to by nnodes contains the number of nodes allocated to  the
job.   This  is  information that is returned during initialization and
does not require communication with  MOM.   If  tm_init  has  not  been
called, TM_ESYSTEM is returned, otherwise TM_SUCCESS is returned.
-----

However, it seems that tm_nodeinfo() always returns TM_ESYSTEM,
regardless of whether tm_init() was invoked or not.  This was not the
behavior in prior versions of Torque.

Here is a sample program that shows the problem:

-----
#include <stdio.h>
#include <tm.h>

int main(int argc, char* argv[])
{
  int ret;
  struct tm_roots tm_root;
  tm_node_id *tm_node_ids = NULL;
  int num_node_ids = 0;

  ret = tm_init(NULL, &tm_root);
  if (TM_SUCCESS != ret) {
    printf("tm_init failed\n");
    exit(0);
  }

  ret = tm_nodeinfo(&tm_node_ids, &num_node_ids);
  if (TM_SUCCESS != ret) {
    printf("tm_nodeinfo barfed: ret %d, ESYSTEM %d\n",
           ret, TM_ESYSTEM);
    tm_finalize();
    exit(0);
  }

  tm_finalize();
  exit(0);
}
-----

Here's a compile and run that shows the problem:

-----
aon021:~/tmp jsquyres$ echo $PBS_ENVIRONMENT $PBS_JOBID
PBS_INTERACTIVE 401.aon.[[[private]]]
aon021:~/tmp jsquyres$ pbs-config --version
2.1.0p0
aon021:~/tmp jsquyres$ gcc test_tm.c -o test_tm -I
/home/software/torque-2.1.0p0/include
-L/home/software/torque-2.1.0p0/lib -ltorque
aon021:~/tmp jsquyres$ ./test_tm
tm_nodeinfo barfed: ret 17000, ESYSTEM 17000
aon021:~/tmp jsquyres$
-----

Is there a possibility that this is a local configuration problem?  If
it is a genuine bug in Torque, it hoses everyone who is using LAM/MPI
and/or Open MPI (possibly mpiexec?  I don't know if mpiexec uses
tm_nodeinfo()..)

Thanks.

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


More information about the torquedev mailing list