[torquedev] Problem with TM interface in Torque 2.1.0p0

garrick at speculation.org garrick at speculation.org
Thu May 18 16:38:11 MDT 2006


On Thu, May 18, 2006 at 05:37:05PM -0400, Jeff Squyres (jsquyres) alleged:
> Greetings (apologies if this comes across twice; I initially sent this
> from an address that was not subscribed and it disappeared into the
> ether -- I don't know if it was discarded or sent to moderation).
> 
> Responding to a user post on the Open MPI mailing list, we may have a
> found a bug in the TM interface in Torque 2.1.0p0.  According to the tm
> man page:
> 
> -----
> tm_nodeinfo()  places a pointer to a malloc'ed array of tm_node_id's in
> the pointer pointed at by list.  The order of the tm_node_id's in  list
> is the same as that specified to MOM in the "exec_host" attribute.  The
> int pointed to by nnodes contains the number of nodes allocated to  the
> job.   This  is  information that is returned during initialization and
> does not require communication with  MOM.   If  tm_init  has  not  been
> called, TM_ESYSTEM is returned, otherwise TM_SUCCESS is returned.
> -----
> 
> However, it seems that tm_nodeinfo() always returns TM_ESYSTEM,
> regardless of whether tm_init() was invoked or not.  This was not the
> behavior in prior versions of Torque.
> 
> Here is a sample program that shows the problem:
> 
> -----
> #include <stdio.h>
> #include <tm.h>
> 
> int main(int argc, char* argv[])
> {
>   int ret;
>   struct tm_roots tm_root;
>   tm_node_id *tm_node_ids = NULL;
>   int num_node_ids = 0;
> 
>   ret = tm_init(NULL, &tm_root);
>   if (TM_SUCCESS != ret) {
>     printf("tm_init failed\n");
>     exit(0);
>   }
> 
>   ret = tm_nodeinfo(&tm_node_ids, &num_node_ids);
>   if (TM_SUCCESS != ret) {
>     printf("tm_nodeinfo barfed: ret %d, ESYSTEM %d\n",
>            ret, TM_ESYSTEM);
>     tm_finalize();
>     exit(0);
>   }
> 
>   tm_finalize();
>   exit(0);
> }
> -----
> 
> Here's a compile and run that shows the problem:
> 
> -----
> aon021:~/tmp jsquyres$ echo $PBS_ENVIRONMENT $PBS_JOBID
> PBS_INTERACTIVE 401.aon.[[[private]]]
> aon021:~/tmp jsquyres$ pbs-config --version
> 2.1.0p0
> aon021:~/tmp jsquyres$ gcc test_tm.c -o test_tm -I
> /home/software/torque-2.1.0p0/include
> -L/home/software/torque-2.1.0p0/lib -ltorque
> aon021:~/tmp jsquyres$ ./test_tm
> tm_nodeinfo barfed: ret 17000, ESYSTEM 17000
> aon021:~/tmp jsquyres$
> -----
> 
> Is there a possibility that this is a local configuration problem?  If
> it is a genuine bug in Torque, it hoses everyone who is using LAM/MPI
> and/or Open MPI (possibly mpiexec?  I don't know if mpiexec uses
> tm_nodeinfo()..)

Both mpiexec and pbsdsh use tm_nodeinfo().  If you can't do 'pbsdsh
hostname' you know there is a problem.

I tried your test program and it works fine for me.  Nothing has changed
in that code in a long time.  I suspect the TM code is fine, but MOM
thinks the job has 0 vnodes (which AFAIK is impossible.)

Can you send a 'qstat -f $jobid' and 'momctl -d 4 -h $node' output?
Both of those should be run as an admin user (root) from the pbs_server
to ensure all info is reported.



More information about the torquedev mailing list