Bug 149 - pbs_mom thread to purge jobs crashes with segfault
: pbs_mom thread to purge jobs crashes with segfault
Status: RESOLVED FIXED
Product: TORQUE
pbs_mom
: 2.4.x
: Other Linux
: P5 major
Assigned To: Ken Nielson
:
:
:
  Show dependency treegraph
 
Reported: 2011-07-14 17:43 MDT by Chris Samuel
Modified: 2011-10-13 12:14 MDT (History)
3 users (show)

See Also:


Attachments
Revert to older job_purge, abstract tmpdir_purge() and run remtree in a forked process (4.58 KB, patch)
2011-07-14 19:21 MDT, Chris Samuel
Details | Diff
Revert to older job_purge, abstract tmpdir_purge() and run remtree in a forked process (4.58 KB, patch)
2011-07-14 19:55 MDT, Chris Samuel
Details | Diff
Revert to older job_purge, abstract tmpdir_purge() and run remtree in a forked process (4.61 KB, patch)
2011-07-14 19:57 MDT, Chris Samuel
Details | Diff
This patch adds correction for parent process after fork in the previous patch (4.73 KB, patch)
2011-07-20 15:00 MDT, Ken Nielson
Details | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description Chris Samuel 2011-07-14 17:43:15 MDT
We are seeing segfaults on 2.4.15 with the new code to run purges in threads.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x40602940 (LWP 26231)]
0x000000339a47288e in free () from /lib64/libc.so.6
(gdb) bt full
#0  0x000000339a47288e in free () from /lib64/libc.so.6
No symbol table info available.
#1  0x000000000042a84d in job_free (pj=0x780220) at job_func.c:534
        __PRETTY_FUNCTION__ = "job_free"
#2  0x000000000042aaba in job_purge_thread (jobtopurge=<value optimized out>)
    at job_func.c:775
        pjob = 0x780220
        namebuf =
"/var/spool/torque/mom_priv/jobs/822623-313.bruce-m.vlsci.unimelb.edu.au.JB",
'\000' <repeats 2810 times>, "\001\000\000\000\020\353\300\232\063", '\000'
<repeats 27 times>, "\002", '\000' <repeats 39 times>,
"\002\002\000\000\000\000\000\000\000\000
\000\000\000\000\000\003\000\000\000\000\000\000\000\000V\020\254\252*\000\000\000\020\000\000\000\000\000\000\030-`@\000\000\000\000\200\000\000\000\000\000\000\000@)`@",
'\000' <repeats 12 times>, "\002", '\000' <repeats 15 times>"\377,
\377\377\377\377\377\377\377\250
`@\000\000\000\000\216\324\300\232\063\000\000\000"...
        rc = <value optimized out>
        id = "job_purge_thread"
#3  0x000000339ac0673d in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4  0x000000339a4d44bd in clone () from /lib64/libc.so.6
No symbol table info available.

We are seeing it randomly during normal production, but predictably if pbs_mom
is started and there are multiple jobs waiting to be purged as they've finished
since pbs_mom crashed (as shown by stale job cpusets in /dev/cpuset/torque).

In the initial request for this feature I'd asked if these could be done in
fork()'d processes and I'm wondering if this new code is thread safe..
Comment 1 Chris Samuel 2011-07-14 18:08:36 MDT
I would suggest that only the $TMPDIR removal is done in a thread, not the rest
of the cleaning up of the job..
Comment 2 Chris Samuel 2011-07-14 19:21:12 MDT
Created an attachment (id=81) [details]
Revert to older job_purge, abstract tmpdir_purge() and run remtree in a forked
process

This patch reverts the code in src/resmom/job_func.c to the pre-pthreads
version, abstracts out the code that surrounds the remtree() in job_purge()
into tmpdir_purge() (only called if there is a $TMPDIR for the job) and runs
that within a fork()'d process.

Very basic testing here seems to show it's OK and have had to put it into
production due to crashes.
Comment 3 Chris Samuel 2011-07-14 19:55:22 MDT
Created an attachment (id=82) [details]
Revert to older job_purge, abstract tmpdir_purge() and run remtree in a forked
process

New version of the patch, forgot it needed to exit() rather than return() in
the child process! :-(
Comment 4 Chris Samuel 2011-07-14 19:57:30 MDT
Created an attachment (id=83) [details]
Revert to older job_purge, abstract tmpdir_purge() and run remtree in a forked
process

Apologies - attached same patch last time, this should be a more correct
version.
Comment 5 Chris Samuel 2011-07-14 23:34:15 MDT
NB: The original AC job_purge_thread() code swaps the processes euid and egid
to the user, removes the $TMPDIR and then swaps them back to the pbs user (aka
root) - this isn't thread safe as it'll affect all execution threads at the
same time.   We were seeing errors about not being able to remove .TK files for
this reason.
Comment 6 Ken Nielson 2011-07-20 15:00:47 MDT
Created an attachment (id=84) [details]
This patch adds correction for parent process after fork in the previous patch

This patch fixes a bug in tmpdir_purge after the fork. The original code
returned to the calling function as the child

if(0 == process_id)
+  {
+     /* We are the parent, so return to carry on with the rest */
+     return;
+  }

The new code fixes this oversight.

+  if(0 != process_id)
+  {
+     /* We are the parent, so return to carry on with the rest */
+     return;
+  }
Comment 7 Ken Nielson 2011-07-20 15:01:56 MDT
this patch has been submitted to TORQUE 2.4.16.
Comment 8 Chris Samuel 2011-07-24 18:25:50 MDT
Somehow part of the context from the unified diff I submitted as the patch has
made it into the 2.4-fixes tree, so I'm guessing it wasn't applied with patch ?

The code:

  /* Only remove job directory on mother superior - CS @ VLSCI 2011/07/12 */
  if ((pjob->ji_qs.ji_svrflags & JOB_SVFLG_HERE) && (pjob->ji_flags &
MOM_HAS_TMPDIR))

which is part of a different local patch has got into your SVN repo and needs
to be reverted to the original code:

  if (pjob->ji_flags & MOM_HAS_TMPDIR)
Comment 9 Ken Nielson 2011-07-25 12:54:30 MDT
I went back to job_func.c before we tried the original patch. Similar to what
Chris has done. that code works as far as not hanging the MOM. However, as soon
as I do a fork in job_purge the MOM will do only one job and then the server
counts the MOM node down. Same behavior we are getting with Chris's patch.

I am still working on it but I am asking to see if anyone out there may have a
suggestions as to why this is failing.
Comment 10 Chris Samuel 2011-07-25 17:07:15 MDT
Could it be that the pbs_mom is not dealing gracefully with the SIGCHLD
received when that forked process exit()s ?
Comment 11 Chris Samuel 2011-07-26 23:40:58 MDT
Version in 2.4.16-snap.201107261642 is looking a lot better, no crashes or
pbs_mom drop outs so far after 100 test jobs, thanks!
Comment 12 Ken Nielson 2011-07-27 13:13:51 MDT
Fixed in 2.4.16
Comment 13 Glen 2011-10-13 12:14:18 MDT
I only see mention of TORQUE 2.4 in this bug, however I am seeing occasional
instances where pbs_mom assumes the UID of a regular user after cleaning up a
temporary directory.  This does not occur in every case, but certain types of
jobs seem to exhibit it more than others (maybe it is more likely with larger
amounts of data in the tmpdir?).  Could this be the same bug, or is it a
different race condition? We can go weeks without it happening, but yesterday
we saw it happen 2 or 3 times.