Bugzilla – Bug 174
pbs_mom kills running jobs despite -p flag
Last modified: 2012-03-08 09:49:36 MST
You need to log in before you can comment on or make changes to this bug.
I want to restart a pbs_mom on a node where it has died for whatever reason without killing the jobs that are still running on the node. We used to be able to do this by starting the pbs_mom with the -p flag, but apparently this is not working anymore: everytime I start the mom using "pbs_mom -p" all running jobs get killed. My feeling is that -p stopped working when we started to use cpusets (I am not absolutely sure about this since we also upgraded torque versions since then). We are currently running torque-2.5.10.
When I replace the line 214 in cpuset.c if (cpuset_delete(pdirent->d_name) == 0) with "if (0)" then the jobs do not get killed when I restart pbs_mom.
(In reply to comment #1) > When I replace the line 214 in cpuset.c > > if (cpuset_delete(pdirent->d_name) == 0) > > with "if (0)" then the jobs do not get killed when I restart pbs_mom. I have added this to the AC internal ticketing system so we can get it fixed.
Hi, I can confirm we experience the same issue since switching cpuset support on. Currently we run 2.5.10 and the problem persists Good thing is that CPUsets should ease process tracking is such case since all child processess spawned by given job are available in: /dev/cpuset/torque/<jobid>/tasks cat /dev/cpuset/torque/19164583.batch.grid.cyf-kr.edu.pl/tasks 10807 10866 10875 10889 10956 10962 10965 10993 10994 10995 10996 10997 10998 10999 11000 Cheers -- LKF
Looks like this has been fixed in SVN with commits 5855 and 5856. The commit doesn't reference this BZ number unfortunately.
Fixed in 2.5.11 revision 5855