Bugzilla – Bug 224
pbs_mom does not create login shells for regular batch jobs
Last modified: 2013-03-19 20:07:19 MDT
You need to log in before you can comment on or make changes to this bug.
When using torque 4.1.3, non-interactive jobs are run in a non-login shell. This prevents /etc/profile or the user's ~/.bashrc file from running at the beginning of a cluster job. The relevant function "source_login_shells_or_not" found in src/resmom/start_exec.c (line 3747 in torque-4.1.3), does not appear to run. I inserted debugging statements to try to see the function run, but there was no output. This function is called from TMomFinalizeChild. We are running torque-4.1.3 on a RHEL6.3 cluster with maui 3.3.1 Charles
(In reply to comment #0) > When using torque 4.1.3, non-interactive jobs are run in a non-login shell. > This prevents /etc/profile or the user's ~/.bashrc file from running at the > beginning of a cluster job. > > The relevant function "source_login_shells_or_not" found in > src/resmom/start_exec.c (line 3747 in torque-4.1.3), does not appear to run. I > inserted debugging statements to try to see the function run, but there was no > output. This function is called from TMomFinalizeChild. > > We are running torque-4.1.3 on a RHEL6.3 cluster with maui 3.3.1 > > Charles Update based on email on torque-users list: The problem is specific to non-interactive jobs with checkpoint enabled. I can reproduce the problem by using a queue with "checkpoint_defaults" set or by passing "-c enabled" to qsub. When checkpointing is not enabled, jobs run in login shells correctly. Charles
I think we've found the source of the problem, but it's a bit tricky. All source code references are based on 4.1.4: As it turns out, source_login_shells_or_not *is* being properly called, and it makes the correct decisions; the problem appears to be in resmom/checkpoint.c, in the routine mom_checkpoint_execute_job. At the beginning of mom_checkpoint_execute_job for one test job, we can see that said job enters the function with argv[0] set to "-bash"; this is evidence that source_login_shells_or_not correctly designated a login shell. HOWEVER, at line 205 of checkpoint.c, we have the following: arg[0] = checkpoint_run_exe_name; This removes the "-bash" from argv, and the shell no longer has the expected indication that it will be a login shell. Here is where I believe things get messy. Even if we modified line 225: execve(checkpoint_run_exe_name, arg, vtable->v_envp); to attempt to re-inject "-bash" into the exec, I think it is quite certain that cr_run (or any trivial wrapper thereof) will have no way to carry the leading hyphen through to the actual shell invocation itself. This is not a big deal if we instead add "--login" or "-l" (though this may not be portable to non-bash shells; I don't know yet) to another position in argv[], but since argv[] is manipulated in so many places throughout the source code I don't dare just add another slot and insert another argument without some advice. Should we be attempting to create a patch, waiting for dev feedback, or both?
It seems I misunderstood the scope of the fix on my last post; I don't think it's really that complicated after all. A copy of the array should suffice for the exec, with an additional "-l" argument inserted at position 1; the original array should not need to be tampered with. A nearly trivial patch could do this for, say, bash alone, but I still don't know whether attempting to detect specific shells is a desirable approach; I don't mess much with other shells. It looks like ksh and bash are happy if one prepends the argument list with "-l" for a login shell, but csh may require that no other options are present; I haven't found anything useful about any other shells, but it does appear that "-l" is less universal than the old hyphen trick. Does this change anything? Should we just be patching torque internally for our own preferred shells if we can't submit a universal patch that works for most/all common shells? Or would specific fixes for specific shells, with current behavior as a default to fall back on still be better than nothing?
The correct solution is to persist the hyphen throughout and insure that it is in place when the execution takes place. If cr_run fails to persist the hyphen, it too must be patched. But the hyphen has been the standard for years; there are, AFAIK, no equally-portable alternatives.
(In reply to comment #4) > The correct solution is to persist the hyphen throughout and insure that it is > in place when the execution takes place. If cr_run fails to persist the > hyphen, it too must be patched. But the hyphen has been the standard for > years; there are, AFAIK, no equally-portable alternatives. Thanks; that helps. Note, though, that line 205 of resmom/checkpoint.c *purges* the hyphen completely from argv *before* cr_run is ever executed, so cr_run will never *receive* it to pass it through. Even if this is fixed, cr_run may have no way of knowing how to pass it through exactly as intended (since the hyphen *replaces* the dirname in argv[0], meaning cr_run might have to guess at the full path for the shell), so I think it will be more of a redesign than a patch for cr_run. Should we be trying to collect any more information for the list, or are we bound for a WONTFIX anyway?
I didn't think Torque ever started the users shell as a login shell, we've always had to add: BASH_ENV=/etc/profile.d/modules.sh to /var/spool/torque/pbs_environment to force modules to get loaded for bash users (which is generally all they need, everything else is specified via "module load $FOO" commands in their PBS script).
(In reply to comment #5) > (In reply to comment #4) > > The correct solution is to persist the hyphen throughout and insure that it is > > in place when the execution takes place. If cr_run fails to persist the > > hyphen, it too must be patched. But the hyphen has been the standard for > > years; there are, AFAIK, no equally-portable alternatives. > > Thanks; that helps. Note, though, that line 205 of resmom/checkpoint.c > *purges* the hyphen completely from argv *before* cr_run is ever executed, so > cr_run will never *receive* it to pass it through. Even if this is fixed, > cr_run may have no way of knowing how to pass it through exactly as intended > (since the hyphen *replaces* the dirname in argv[0], meaning cr_run might have > to guess at the full path for the shell), so I think it will be more of a > redesign than a patch for cr_run. > > Should we be trying to collect any more information for the list, or are we > bound for a WONTFIX anyway? Thanks for the advice, Michael. In addition to being much more brusque and preachy than I meant to be (my apologies for that), I was flat wrong about how cr_run works. It may indeed be a simple patch to effect the behavior you've described in cr_run; hopefully that in turn will render the corresponding fixes to Torque much simpler and clear-cut. I'll post back when I can propose both a correction to the aforementioned argv[0] bug and a sample cr_run analogue that can take advantage of the fix.
(In reply to comment #7) > Thanks for the advice, Michael. In addition to being much more brusque and > preachy than I meant to be (my apologies for that), No worries; we've all been there. :-) > I was flat wrong about how > cr_run works. It may indeed be a simple patch to effect the behavior you've > described in cr_run; hopefully that in turn will render the corresponding fixes > to Torque much simpler and clear-cut. I'll post back when I can propose both a > correction to the aforementioned argv[0] bug and a sample cr_run analogue that > can take advantage of the fix. I was hoping that would be the case. Definitely let us know what you come up with! :-)
All right, so here is a basic patch, relatively untested and with its problems; this should NOT be considered final yet: --- checkpoint.c.orig 2013-03-18 18:20:13.000000000 -0500 +++ checkpoint.c 2013-03-18 18:20:41.000000000 -0500 @@ -201,9 +201,13 @@ return(-1); } + char* final_arg0 = arg[0]; // Keep this for later strcpy(arg[1], shell); arg[0] = checkpoint_run_exe_name; + char* newarg[9] = { arg[0], "-a", final_arg0, arg[1], + arg[2], arg[3], arg[4], arg[5], NULL }; + if (LOGLEVEL >= 10) { char cmd[MAXPATHLEN + 1]; @@ -222,7 +226,7 @@ log_ext(-1, __func__, log_buffer, LOG_DEBUG); } - execve(checkpoint_run_exe_name, arg, vtable->v_envp); + execve(checkpoint_run_exe_name, newarg, vtable->v_envp); return (0); } This does bear some explanation, though fortunately the patch is surprisingly trivial. First, we copy the argv[0] which we are about to overwrite. Then, we assemble this, along with the rest of our argument vector and a secret cr_run flag (more on that soon) into a new argument vector for cr_run. Very simple, but there are some assumptions and problems: PROBLEM: Log output does not know about this change, and still prints the pre-patch command. I just need to know that adding "-a <old argv[0]>" won't cause memory access violations, since I'm not entirely certain we won't go past MAXPATHLEN+1. Please advise if you are familiar with any guidelines about this; if nothing else I will just create a new string with length MAXPATHLEN+strlen("-a")+strlen(final_arg0)+3 and print that (if appropriate for the current checkpoint type). ASSUMPTION: We are using cr_run as our checkpoint executable. Not a safe assumption, but the reasoning first: as was explained to me on the blcr mailing list, cr_run ends with an 'exec "$@"', which exposes the bash exec and its (since 2.0 in 1996 by my count) "-a ARGV0" and "-l" arguments to a caller who expects them. I think I will ask about this practice on the blcr mailing list; my original plan had been to expose these arguments via new arguments to cr_run, but as this patch works without these modifications, I am starting with a minimal version. I chose "-a" over "-l" because this way the logic already implemented in source_login_shells_or_not(...) is used without modification. PROBLEM: This assumption about cr_run may break other checkpoint executables. I considered adding a strcmp(...) to check for cr_run, but as this is unlikely to solve the problem I abstained. I suspect checking checkpoint_system_type against CST_BLCR would be the best route. Again, advice from those in the know would be appreciated. In short, I believe this patch will solve the problem on blcr systems and (in its current state) compound the problem on non-blcr checkpointing systems. Once this fix is applied *only* if the current checkpoint type is blcr, it should solve the problem on blcr systems and do nothing to affect non-blcr systems. Comments, anyone? I will post back after I've had a chance to test a checkpoint_system_type conditional; since that makes the patch messier, though, I wanted to clear the oversimplified version first. Thanks for all the ongoing advice and assistance. Phil
This is a clumsier patch, but it should deal with the checkpoint type issues: diff -u checkpoint.c.orig checkpoint.c --- checkpoint.c.orig 2013-03-18 18:20:13.000000000 -0500 +++ checkpoint.c 2013-03-18 19:22:46.000000000 -0500 @@ -201,19 +201,50 @@ return(-1); } + char* final_arg0 = arg[0]; // Keep this for later strcpy(arg[1], shell); arg[0] = checkpoint_run_exe_name; + char** newarg; + if (mom_does_checkpoint() == CST_BLCR) + { + newarg = calloc(sizeof(char*),9); + newarg[0] = arg[0]; + newarg[1] = "-a"; + newarg[2] = final_arg0; + newarg[3] = arg[1]; + newarg[4] = arg[2]; + newarg[5] = arg[3]; + newarg[6] = arg[4]; + newarg[7] = arg[5]; + newarg[8] = NULL; + } + else + { + newarg = arg; + } + if (LOGLEVEL >= 10) { - char cmd[MAXPATHLEN + 1]; + int cmdlength; + char* cmd; + if (mom_does_checkpoint() == CST_BLCR) + { + cmdlength = MAXPATHLEN + 1; + } + else + { + // Two more spaces, a "-a", and our new argv[0] + cmdlength = MAXPATHLEN + strlen(final_arg0) + 5; + } + cmd = calloc(1, cmdlength); int i; - strcpy(cmd,arg[0]); - for (i = 1; arg[i] != NULL; i++) + strcpy(cmd,newarg[0]); + for (i = 1; newarg[i] != NULL; i++) { strcat(cmd," "); - strcat(cmd,arg[i]); + strcat(cmd,newarg[i]); } strcat(cmd,")"); @@ -222,7 +253,7 @@ log_ext(-1, __func__, log_buffer, LOG_DEBUG); } - execve(checkpoint_run_exe_name, arg, vtable->v_envp); + execve(checkpoint_run_exe_name, newarg, vtable->v_envp); return (0); }
Sorry for all the bugzilla spam. The latest patch (Comment #10) is only safe for distributions which call bash as /bin/sh; as was explained to me on the blcr mailing list, dash and others may not handle this "-a" invocation as intended, which may cause this patch to break checkpointing on Debian-based installations unless the first line of cr_run is changed from #!/bin/sh to #!/bin/bash So far, though, testing here is positive; I now get login shells and full user environment submitting checkpointable jobs. I'll post back if I can get an authoritative recommendation from the blcr group; if not, can we request this patch be approved for inclusion in contrib/ along with a safe cr_run? I can write some basic documentation for usage if that would help.
Recent discussion on the blcr list seems to lean heavily toward implicitly preserving the "-a" and "-l" options to cr_run by officially switching from sh to bash in a future revision and documenting correct use of these flags; since my previous patch is not safe for all distributions, though, what is the preferred option? If I rebase the patch properly, can we include it for testing? Should we hold off until some future release of blcr supports these options, and then add a build option to include it if the presently installed version of blcr meets that minimum version number? At any rate I think we (KU/ITTC) have solved the problem as we experience it here, so we don't necessarily need this included officially, but I would like to think that now that we know the source of the problem and at least one possible fix, with parallel momentum from the blcr list, that we are at least approaching a general solution that works for other torque sites too. Is there anything else I can do to further that end?
I realize it might be a little late to be saying this, but this may need to be moved over to GitHub in order to receive sufficient attention from the entire development team. Also, if you can supply a branch that has your proposed changes on it, you're more likely to get other sites to try it and report back. In theory, anyway.... :-)