[torquedev] Problem with TM interface in Torque 2.1.0p0
garrick at speculation.org
garrick at speculation.org
Thu May 18 16:38:11 MDT 2006
On Thu, May 18, 2006 at 05:37:05PM -0400, Jeff Squyres (jsquyres) alleged:
> Greetings (apologies if this comes across twice; I initially sent this
> from an address that was not subscribed and it disappeared into the
> ether -- I don't know if it was discarded or sent to moderation).
>
> Responding to a user post on the Open MPI mailing list, we may have a
> found a bug in the TM interface in Torque 2.1.0p0. According to the tm
> man page:
>
> -----
> tm_nodeinfo() places a pointer to a malloc'ed array of tm_node_id's in
> the pointer pointed at by list. The order of the tm_node_id's in list
> is the same as that specified to MOM in the "exec_host" attribute. The
> int pointed to by nnodes contains the number of nodes allocated to the
> job. This is information that is returned during initialization and
> does not require communication with MOM. If tm_init has not been
> called, TM_ESYSTEM is returned, otherwise TM_SUCCESS is returned.
> -----
>
> However, it seems that tm_nodeinfo() always returns TM_ESYSTEM,
> regardless of whether tm_init() was invoked or not. This was not the
> behavior in prior versions of Torque.
>
> Here is a sample program that shows the problem:
>
> -----
> #include <stdio.h>
> #include <tm.h>
>
> int main(int argc, char* argv[])
> {
> int ret;
> struct tm_roots tm_root;
> tm_node_id *tm_node_ids = NULL;
> int num_node_ids = 0;
>
> ret = tm_init(NULL, &tm_root);
> if (TM_SUCCESS != ret) {
> printf("tm_init failed\n");
> exit(0);
> }
>
> ret = tm_nodeinfo(&tm_node_ids, &num_node_ids);
> if (TM_SUCCESS != ret) {
> printf("tm_nodeinfo barfed: ret %d, ESYSTEM %d\n",
> ret, TM_ESYSTEM);
> tm_finalize();
> exit(0);
> }
>
> tm_finalize();
> exit(0);
> }
> -----
>
> Here's a compile and run that shows the problem:
>
> -----
> aon021:~/tmp jsquyres$ echo $PBS_ENVIRONMENT $PBS_JOBID
> PBS_INTERACTIVE 401.aon.[[[private]]]
> aon021:~/tmp jsquyres$ pbs-config --version
> 2.1.0p0
> aon021:~/tmp jsquyres$ gcc test_tm.c -o test_tm -I
> /home/software/torque-2.1.0p0/include
> -L/home/software/torque-2.1.0p0/lib -ltorque
> aon021:~/tmp jsquyres$ ./test_tm
> tm_nodeinfo barfed: ret 17000, ESYSTEM 17000
> aon021:~/tmp jsquyres$
> -----
>
> Is there a possibility that this is a local configuration problem? If
> it is a genuine bug in Torque, it hoses everyone who is using LAM/MPI
> and/or Open MPI (possibly mpiexec? I don't know if mpiexec uses
> tm_nodeinfo()..)
Both mpiexec and pbsdsh use tm_nodeinfo(). If you can't do 'pbsdsh
hostname' you know there is a problem.
I tried your test program and it works fine for me. Nothing has changed
in that code in a long time. I suspect the TM code is fine, but MOM
thinks the job has 0 vnodes (which AFAIK is impossible.)
Can you send a 'qstat -f $jobid' and 'momctl -d 4 -h $node' output?
Both of those should be run as an admin user (root) from the pbs_server
to ensure all info is reported.
More information about the torquedev
mailing list