Bug 174 - pbs_mom kills running jobs despite -p flag
: pbs_mom kills running jobs despite -p flag
Status: RESOLVED FIXED
Product: TORQUE
pbs_mom
: 2.5.x
: PC Linux
: P5 critical
Assigned To: Ken Nielson
:
:
:
  Show dependency treegraph
 
Reported: 2012-03-05 21:18 MST by Martin Siegert
Modified: 2012-03-08 09:49 MST (History)
3 users (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description Martin Siegert 2012-03-05 21:18:43 MST
I want to restart a pbs_mom on a node where it has died for whatever reason
without killing the jobs that are still running on the node. We used to be
able to do this by starting the pbs_mom with the -p flag, but apparently this
is not working anymore: everytime I start the mom using "pbs_mom -p" all
running jobs get killed. My feeling is that -p stopped working when we started
to use cpusets (I am not absolutely sure about this since we also upgraded
torque versions since then). We are currently running torque-2.5.10.
Comment 1 Martin Siegert 2012-03-06 14:41:47 MST
When I replace the line 214 in cpuset.c

        if (cpuset_delete(pdirent->d_name) == 0)

with "if (0)" then the jobs do not get killed when I restart pbs_mom.
Comment 2 Ken Nielson 2012-03-06 16:28:39 MST
(In reply to comment #1)
> When I replace the line 214 in cpuset.c
> 
>         if (cpuset_delete(pdirent->d_name) == 0)
> 
> with "if (0)" then the jobs do not get killed when I restart pbs_mom.

I have added this to the AC internal ticketing system so we can get it fixed.
Comment 3 Lukasz Flis 2012-03-07 06:35:57 MST
Hi, 

I can confirm we experience the same issue since switching cpuset support on.

Currently we run 2.5.10 and the problem persists


Good thing is that CPUsets should ease process tracking is such case since all
child processess spawned by given job are available in:

/dev/cpuset/torque/<jobid>/tasks

cat /dev/cpuset/torque/19164583.batch.grid.cyf-kr.edu.pl/tasks
10807
10866
10875
10889
10956
10962
10965
10993
10994
10995
10996
10997
10998
10999
11000

Cheers
--
LKF
Comment 4 Chris Samuel 2012-03-07 20:11:37 MST
Looks like this has been fixed in SVN with commits 5855 and 5856.

The commit doesn't reference this BZ number unfortunately.
Comment 5 Ken Nielson 2012-03-08 09:49:36 MST
Fixed in 2.5.11 revision 5855