Bugzilla – Bug 149
pbs_mom thread to purge jobs crashes with segfault
Last modified: 2011-10-13 12:14:18 MDT
You need to log in before you can comment on or make changes to this bug.
We are seeing segfaults on 2.4.15 with the new code to run purges in threads. Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x40602940 (LWP 26231)] 0x000000339a47288e in free () from /lib64/libc.so.6 (gdb) bt full #0 0x000000339a47288e in free () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000042a84d in job_free (pj=0x780220) at job_func.c:534 __PRETTY_FUNCTION__ = "job_free" #2 0x000000000042aaba in job_purge_thread (jobtopurge=<value optimized out>) at job_func.c:775 pjob = 0x780220 namebuf = "/var/spool/torque/mom_priv/jobs/822623-313.bruce-m.vlsci.unimelb.edu.au.JB", '\000' <repeats 2810 times>, "\001\000\000\000\020\353\300\232\063", '\000' <repeats 27 times>, "\002", '\000' <repeats 39 times>, "\002\002\000\000\000\000\000\000\000\000 \000\000\000\000\000\003\000\000\000\000\000\000\000\000V\020\254\252*\000\000\000\020\000\000\000\000\000\000\030-`@\000\000\000\000\200\000\000\000\000\000\000\000@)`@", '\000' <repeats 12 times>, "\002", '\000' <repeats 15 times>"\377, \377\377\377\377\377\377\377\250 `@\000\000\000\000\216\324\300\232\063\000\000\000"... rc = <value optimized out> id = "job_purge_thread" #3 0x000000339ac0673d in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x000000339a4d44bd in clone () from /lib64/libc.so.6 No symbol table info available. We are seeing it randomly during normal production, but predictably if pbs_mom is started and there are multiple jobs waiting to be purged as they've finished since pbs_mom crashed (as shown by stale job cpusets in /dev/cpuset/torque). In the initial request for this feature I'd asked if these could be done in fork()'d processes and I'm wondering if this new code is thread safe..
I would suggest that only the $TMPDIR removal is done in a thread, not the rest of the cleaning up of the job..
Created an attachment (id=81) [details] Revert to older job_purge, abstract tmpdir_purge() and run remtree in a forked process This patch reverts the code in src/resmom/job_func.c to the pre-pthreads version, abstracts out the code that surrounds the remtree() in job_purge() into tmpdir_purge() (only called if there is a $TMPDIR for the job) and runs that within a fork()'d process. Very basic testing here seems to show it's OK and have had to put it into production due to crashes.
Created an attachment (id=82) [details] Revert to older job_purge, abstract tmpdir_purge() and run remtree in a forked process New version of the patch, forgot it needed to exit() rather than return() in the child process! :-(
Created an attachment (id=83) [details] Revert to older job_purge, abstract tmpdir_purge() and run remtree in a forked process Apologies - attached same patch last time, this should be a more correct version.
NB: The original AC job_purge_thread() code swaps the processes euid and egid to the user, removes the $TMPDIR and then swaps them back to the pbs user (aka root) - this isn't thread safe as it'll affect all execution threads at the same time. We were seeing errors about not being able to remove .TK files for this reason.
Created an attachment (id=84) [details] This patch adds correction for parent process after fork in the previous patch This patch fixes a bug in tmpdir_purge after the fork. The original code returned to the calling function as the child if(0 == process_id) + { + /* We are the parent, so return to carry on with the rest */ + return; + } The new code fixes this oversight. + if(0 != process_id) + { + /* We are the parent, so return to carry on with the rest */ + return; + }
this patch has been submitted to TORQUE 2.4.16.
Somehow part of the context from the unified diff I submitted as the patch has made it into the 2.4-fixes tree, so I'm guessing it wasn't applied with patch ? The code: /* Only remove job directory on mother superior - CS @ VLSCI 2011/07/12 */ if ((pjob->ji_qs.ji_svrflags & JOB_SVFLG_HERE) && (pjob->ji_flags & MOM_HAS_TMPDIR)) which is part of a different local patch has got into your SVN repo and needs to be reverted to the original code: if (pjob->ji_flags & MOM_HAS_TMPDIR)
I went back to job_func.c before we tried the original patch. Similar to what Chris has done. that code works as far as not hanging the MOM. However, as soon as I do a fork in job_purge the MOM will do only one job and then the server counts the MOM node down. Same behavior we are getting with Chris's patch. I am still working on it but I am asking to see if anyone out there may have a suggestions as to why this is failing.
Could it be that the pbs_mom is not dealing gracefully with the SIGCHLD received when that forked process exit()s ?
Version in 2.4.16-snap.201107261642 is looking a lot better, no crashes or pbs_mom drop outs so far after 100 test jobs, thanks!
Fixed in 2.4.16
I only see mention of TORQUE 2.4 in this bug, however I am seeing occasional instances where pbs_mom assumes the UID of a regular user after cleaning up a temporary directory. This does not occur in every case, but certain types of jobs seem to exhibit it more than others (maybe it is more likely with larger amounts of data in the tmpdir?). Could this be the same bug, or is it a different race condition? We can go weeks without it happening, but yesterday we saw it happen 2 or 3 times.