Bug 224 - pbs_mom does not create login shells for regular batch jobs
: pbs_mom does not create login shells for regular batch jobs
Status: NEW
Product: TORQUE
pbs_mom
: 4.1.*
: PC Linux
: P5 normal
Assigned To: Ken Nielson
:
:
:
  Show dependency treegraph
 
Reported: 2012-11-27 15:37 MST by Charles Henry
Modified: 2013-03-19 20:07 MDT (History)
4 users (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description Charles Henry 2012-11-27 15:37:49 MST
When using torque 4.1.3, non-interactive jobs are run in a non-login shell. 
This prevents /etc/profile or the user's ~/.bashrc file from running at the
beginning of a cluster job.

The relevant function "source_login_shells_or_not" found in
src/resmom/start_exec.c (line 3747 in torque-4.1.3), does not appear to run.  I
inserted debugging statements to try to see the function run, but there was no
output.  This function is called from TMomFinalizeChild.

We are running torque-4.1.3 on a RHEL6.3 cluster with maui 3.3.1

Charles
Comment 1 Charles Henry 2012-12-03 16:20:04 MST
(In reply to comment #0)
> When using torque 4.1.3, non-interactive jobs are run in a non-login shell. 
> This prevents /etc/profile or the user's ~/.bashrc file from running at the
> beginning of a cluster job.
> 
> The relevant function "source_login_shells_or_not" found in
> src/resmom/start_exec.c (line 3747 in torque-4.1.3), does not appear to run.  I
> inserted debugging statements to try to see the function run, but there was no
> output.  This function is called from TMomFinalizeChild.
> 
> We are running torque-4.1.3 on a RHEL6.3 cluster with maui 3.3.1
> 
> Charles

Update based on email on torque-users list:

The problem is specific to non-interactive jobs with checkpoint enabled.  

I can reproduce the problem by using a queue with "checkpoint_defaults" set or
by passing "-c enabled" to qsub.

When checkpointing is not enabled, jobs run in login shells correctly.

Charles
Comment 2 Phil Regier 2013-03-01 16:35:12 MST
I think we've found the source of the problem, but it's a bit tricky.  All
source code references are based on 4.1.4:

As it turns out, source_login_shells_or_not *is* being properly called, and it
makes the correct decisions; the problem appears to be in resmom/checkpoint.c,
in the routine mom_checkpoint_execute_job.

At the beginning of mom_checkpoint_execute_job for one test job, we can see
that said job enters the function with argv[0] set to "-bash"; this is evidence
that source_login_shells_or_not correctly designated a login shell.  HOWEVER,
at line 205 of checkpoint.c, we have the following:

  arg[0] = checkpoint_run_exe_name;

This removes the "-bash" from argv, and the shell no longer has the expected
indication that it will be a login shell.

Here is where I believe things get messy.  Even if we modified line 225:

  execve(checkpoint_run_exe_name, arg, vtable->v_envp);

to attempt to re-inject "-bash" into the exec, I think it is quite certain that
cr_run (or any trivial wrapper thereof) will have no way to carry the leading
hyphen through to the actual shell invocation itself.  This is not a big deal
if we instead add "--login" or "-l" (though this may not be portable to
non-bash shells; I don't know yet) to another position in argv[], but since
argv[] is manipulated in so many places throughout the source code I don't dare
just add another slot and insert another argument without some advice.

Should we be attempting to create a patch, waiting for dev feedback, or both?
Comment 3 Phil Regier 2013-03-03 23:05:44 MST
It seems I misunderstood the scope of the fix on my last post; I don't think
it's really that complicated after all.  A copy of the array should suffice for
the exec, with an additional "-l" argument inserted at position 1; the original
array should not need to be tampered with.

A nearly trivial patch could do this for, say, bash alone, but I still don't
know whether attempting to detect specific shells is a desirable approach; I
don't mess much with other shells.  It looks like ksh and bash are happy if one
prepends the argument list with "-l" for a login shell, but csh may require
that no other options are present; I haven't found anything useful about any
other shells, but it does appear that "-l" is less universal than the old
hyphen trick.

Does this change anything?  Should we just be patching torque internally for
our own preferred shells if we can't submit a universal patch that works for
most/all common shells?  Or would specific fixes for specific shells, with
current behavior as a default to fall back on still be better than nothing?
Comment 4 Michael Jennings 2013-03-04 10:26:46 MST
The correct solution is to persist the hyphen throughout and insure that it is
in place when the execution takes place.  If cr_run fails to persist the
hyphen, it too must be patched.  But the hyphen has been the standard for
years; there are, AFAIK, no equally-portable alternatives.
Comment 5 Phil Regier 2013-03-04 12:29:32 MST
(In reply to comment #4)
> The correct solution is to persist the hyphen throughout and insure that it is
> in place when the execution takes place.  If cr_run fails to persist the
> hyphen, it too must be patched.  But the hyphen has been the standard for
> years; there are, AFAIK, no equally-portable alternatives.

Thanks; that helps.  Note, though, that line 205 of resmom/checkpoint.c
*purges* the hyphen completely from argv *before* cr_run is ever executed, so
cr_run will never *receive* it to pass it through.  Even if this is fixed,
cr_run may have no way of knowing how to pass it through exactly as intended
(since the hyphen *replaces* the dirname in argv[0], meaning cr_run might have
to guess at the full path for the shell), so I think it will be more of a
redesign than a patch for cr_run.

Should we be trying to collect any more information for the list, or are we
bound for a WONTFIX anyway?
Comment 6 Chris Samuel 2013-03-04 18:56:22 MST
I didn't think Torque ever started the users shell as a login shell, we've
always had to add:

BASH_ENV=/etc/profile.d/modules.sh

to /var/spool/torque/pbs_environment to force modules to get loaded for bash
users (which is generally all they need, everything else is specified via
"module load $FOO" commands in their PBS script).
Comment 7 Phil Regier 2013-03-06 17:42:29 MST
(In reply to comment #5)
> (In reply to comment #4)
> > The correct solution is to persist the hyphen throughout and insure that it is
> > in place when the execution takes place.  If cr_run fails to persist the
> > hyphen, it too must be patched.  But the hyphen has been the standard for
> > years; there are, AFAIK, no equally-portable alternatives.
> 
> Thanks; that helps.  Note, though, that line 205 of resmom/checkpoint.c
> *purges* the hyphen completely from argv *before* cr_run is ever executed, so
> cr_run will never *receive* it to pass it through.  Even if this is fixed,
> cr_run may have no way of knowing how to pass it through exactly as intended
> (since the hyphen *replaces* the dirname in argv[0], meaning cr_run might have
> to guess at the full path for the shell), so I think it will be more of a
> redesign than a patch for cr_run.
> 
> Should we be trying to collect any more information for the list, or are we
> bound for a WONTFIX anyway?

Thanks for the advice, Michael.  In addition to being much more brusque and
preachy than I meant to be (my apologies for that), I was flat wrong about how
cr_run works.  It may indeed be a simple patch to effect the behavior you've
described in cr_run; hopefully that in turn will render the corresponding fixes
to Torque much simpler and clear-cut.  I'll post back when I can propose both a
correction to the aforementioned argv[0] bug and a sample cr_run analogue that
can take advantage of the fix.
Comment 8 Michael Jennings 2013-03-12 18:34:28 MDT
(In reply to comment #7)
> Thanks for the advice, Michael.  In addition to being much more brusque and
> preachy than I meant to be (my apologies for that),

No worries; we've all been there.  :-)

> I was flat wrong about how
> cr_run works.  It may indeed be a simple patch to effect the behavior you've
> described in cr_run; hopefully that in turn will render the corresponding fixes
> to Torque much simpler and clear-cut.  I'll post back when I can propose both a
> correction to the aforementioned argv[0] bug and a sample cr_run analogue that
> can take advantage of the fix.

I was hoping that would be the case.  Definitely let us know what you come up
with!  :-)
Comment 9 Phil Regier 2013-03-18 18:35:55 MDT
All right, so here is a basic patch, relatively untested and with its problems;
this should NOT be considered final yet:

--- checkpoint.c.orig    2013-03-18 18:20:13.000000000 -0500
+++ checkpoint.c    2013-03-18 18:20:41.000000000 -0500
@@ -201,9 +201,13 @@
     return(-1);
     }

+  char* final_arg0 = arg[0];  // Keep this for later
   strcpy(arg[1], shell);
   arg[0] = checkpoint_run_exe_name;

+  char* newarg[9] = { arg[0], "-a", final_arg0, arg[1],
+              arg[2], arg[3], arg[4], arg[5], NULL };
+
   if (LOGLEVEL >= 10)
     {
     char cmd[MAXPATHLEN + 1];
@@ -222,7 +226,7 @@
     log_ext(-1, __func__, log_buffer, LOG_DEBUG);
     }

-  execve(checkpoint_run_exe_name, arg, vtable->v_envp);
+  execve(checkpoint_run_exe_name, newarg, vtable->v_envp);

   return (0);
   }


This does bear some explanation, though fortunately the patch is surprisingly
trivial.  First, we copy the argv[0] which we are about to overwrite.  Then, we
assemble this, along with the rest of our argument vector and a secret cr_run
flag (more on that soon) into a new argument vector for cr_run.  Very simple,
but there are some assumptions and problems:


PROBLEM:  Log output does not know about this change, and still prints the
pre-patch command.  I just need to know that adding "-a <old argv[0]>" won't
cause memory access violations, since I'm not entirely certain we won't go past
MAXPATHLEN+1.  Please advise if you are familiar with any guidelines about
this; if nothing else I will just create a new string with length
MAXPATHLEN+strlen("-a")+strlen(final_arg0)+3 and print that (if appropriate for
the current checkpoint type).

ASSUMPTION:  We are using cr_run as our checkpoint executable.  Not a safe
assumption, but the reasoning first:  as was explained to me on the blcr
mailing list, cr_run ends with an 'exec "$@"', which exposes the bash exec and
its (since 2.0 in 1996 by my count) "-a ARGV0" and "-l" arguments to a caller
who expects them.  I think I will ask about this practice on the blcr mailing
list; my original plan had been to expose these arguments via new arguments to
cr_run, but as this patch works without these modifications, I am starting with
a minimal version.  I chose "-a" over "-l" because this way the logic already
implemented in source_login_shells_or_not(...) is used without modification.

PROBLEM:  This assumption about cr_run may break other checkpoint executables. 
I considered adding a strcmp(...) to check for cr_run, but as this is unlikely
to solve the problem I abstained.  I suspect checking checkpoint_system_type
against CST_BLCR would be the best route.  Again, advice from those in the know
would be appreciated.


In short, I believe this patch will solve the problem on blcr systems and (in
its current state) compound the problem on non-blcr checkpointing systems. 
Once this fix is applied *only* if the current checkpoint type is blcr, it
should solve the problem on blcr systems and do nothing to affect non-blcr
systems.

Comments, anyone?  I will post back after I've had a chance to test a
checkpoint_system_type conditional; since that makes the patch messier, though,
I wanted to clear the oversimplified version first.  Thanks for all the ongoing
advice and assistance.

Phil
Comment 10 Phil Regier 2013-03-19 12:18:32 MDT
This is a clumsier patch, but it should deal with the checkpoint type issues:

diff -u checkpoint.c.orig  checkpoint.c
--- checkpoint.c.orig    2013-03-18 18:20:13.000000000 -0500
+++ checkpoint.c    2013-03-18 19:22:46.000000000 -0500
@@ -201,19 +201,50 @@
     return(-1);
     }

+  char* final_arg0 = arg[0];  // Keep this for later
   strcpy(arg[1], shell);
   arg[0] = checkpoint_run_exe_name;

+  char** newarg;
+  if (mom_does_checkpoint() == CST_BLCR)
+    {
+    newarg = calloc(sizeof(char*),9);
+    newarg[0] = arg[0];
+    newarg[1] = "-a";
+    newarg[2] = final_arg0;
+    newarg[3] = arg[1];
+    newarg[4] = arg[2];
+    newarg[5] = arg[3];
+    newarg[6] = arg[4];
+    newarg[7] = arg[5];
+    newarg[8] = NULL;
+    }
+  else
+    {
+    newarg = arg;
+    }
+
   if (LOGLEVEL >= 10)
     {
-    char cmd[MAXPATHLEN + 1];
+    int cmdlength;
+    char* cmd;
+    if (mom_does_checkpoint() == CST_BLCR)
+      {
+      cmdlength = MAXPATHLEN + 1;
+      }
+    else
+      {
+      // Two more spaces, a "-a", and our new argv[0]
+      cmdlength = MAXPATHLEN + strlen(final_arg0) + 5;
+      }
+    cmd = calloc(1, cmdlength);
     int i;

-    strcpy(cmd,arg[0]);
-    for (i = 1; arg[i] != NULL; i++)
+    strcpy(cmd,newarg[0]);
+    for (i = 1; newarg[i] != NULL; i++)
       {
       strcat(cmd," ");
-      strcat(cmd,arg[i]);
+      strcat(cmd,newarg[i]);
       }
     strcat(cmd,")");

@@ -222,7 +253,7 @@
     log_ext(-1, __func__, log_buffer, LOG_DEBUG);
     }

-  execve(checkpoint_run_exe_name, arg, vtable->v_envp);
+  execve(checkpoint_run_exe_name, newarg, vtable->v_envp);

   return (0);
   }
Comment 11 Phil Regier 2013-03-19 13:39:04 MDT
Sorry for all the bugzilla spam.  The latest patch (Comment #10) is only safe
for distributions which call bash as /bin/sh; as was explained to me on the
blcr mailing list, dash and others may not handle this "-a" invocation as
intended, which may cause this patch to break checkpointing on Debian-based
installations unless the first line of cr_run is changed from

#!/bin/sh

to

#!/bin/bash

So far, though, testing here is positive; I now get login shells and full user
environment submitting checkpointable jobs.  I'll post back if I can get an
authoritative recommendation from the blcr group; if not, can we request this
patch be approved for inclusion in contrib/ along with a safe cr_run?  I can
write some basic documentation for usage if that would help.
Comment 12 Phil Regier 2013-03-19 15:29:42 MDT
Recent discussion on the blcr list seems to lean heavily toward implicitly
preserving the "-a" and "-l" options to cr_run by officially switching from sh
to bash in a future revision and documenting correct use of these flags; since
my previous patch is not safe for all distributions, though, what is the
preferred option?  If I rebase the patch properly, can we include it for
testing?  Should we hold off until some future release of blcr supports these
options, and then add a build option to include it if the presently installed
version of blcr meets that minimum version number?

At any rate I think we (KU/ITTC) have solved the problem as we experience it
here, so we don't necessarily need this included officially, but I would like
to think that now that we know the source of the problem and at least one
possible fix, with parallel momentum from the blcr list, that we are at least
approaching a general solution that works for other torque sites too.  Is there
anything else I can do to further that end?
Comment 13 Michael Jennings 2013-03-19 20:07:19 MDT
I realize it might be a little late to be saying this, but this may need to be
moved over to GitHub in order to receive sufficient attention from the entire
development team.

Also, if you can supply a branch that has your proposed changes on it, you're
more likely to get other sites to try it and report back.  In theory,
anyway....  :-)