Bug 61 - pbs_mom changing /dev/null mode and perms
: pbs_mom changing /dev/null mode and perms
Status: RESOLVED FIXED
Product: TORQUE
pbs_mom
: 2.3.x
: Other Linux
: P5 critical
Assigned To: Glen
:
:
:
  Show dependency treegraph
 
Reported: 2010-05-20 03:25 MDT by Arnau
Modified: 2014-02-02 19:30 MST (History)
5 users (show)

See Also:


Attachments
Patch for Torque 2.5.2 that invokes unlink() with job owner credentials (4.35 KB, patch)
2010-08-11 00:52 MDT, Eygene Ryabinkin
Details | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description Arnau 2010-05-20 03:25:12 MDT
pbs_mom changes /dev/null mode and perms when running prologue in some unknown
circunstances:

it goes from:
0 crw-rw-rw- 1 root root 1, 3 May 10 12:31 /dev/null
to
0 -rw-r--r-- 1 root root 0 May 11 10:42 /dev/null

audit logs:

type=PATH msg=audit(05/11/2010 03:32:18.394:111668) : item=1 name=/dev/null
inode=2085 dev=00:11 mode=character,666 ouid=root ogid=root rdev=01:03

type=PATH msg=audit(05/11/2010 03:32:18.394:111668) : item=0 name=/dev/
inode=1120 dev=00:11 mode=dir,755 ouid=root ogid=root rdev=00:00

type=CWD msg=audit(05/11/2010 03:32:18.394:111668) :  cwd=/var/spoo
/pbs/mom_priv

type=SYSCALL msg=audit(05/11/2010 03:32:18.394:111668) : arch=x86_64
syscall=unlink success=yes exit=0 a0=6a17a0 a1=15c724b6 a2=15c724ac
a3=726576726573206e items=2 ppid=1 pid=18208 auid=root uid=root gid=root
euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none)
ses=7327 comm=pbs_mom exe=/usr/sbin/pbs_mom key=NULL_touch
----

notice the ulink syscall.

It happenend when:

# grep 03:32 /var/spool/pbs/mom_logs/20100511 
05/11/2010 03:32:17;0080;  
pbs_mom;Job;10327910.pbs02.pic.es;scan_for_terminated: job
10327910.pbs02.pic.es task 1 terminated, sid=3571
05/11/2010 03:32:17;0008;   pbs_mom;Job;10327910.pbs02.pic.es;job was
terminated
05/11/2010 03:32:17;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
05/11/2010 03:32:17;0080;  
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
while loop
05/11/2010 03:32:17;0080;   pbs_mom;Svr;preobit_reply;in while loop, no error
from job stat
05/11/2010 03:32:17;0008;   pbs_mom;Job;10327910.pbs02.pic.es;checking job
post-processing routine
05/11/2010 03:32:17;0080;   pbs_mom;Job;10327910.pbs02.pic.es;obit sent to
server
05/11/2010 03:32:18;0080;  
pbs_mom;Job;10268969.pbs02.pic.es;scan_for_terminated: job
10268969.pbs02.pic.es task 1 terminated, sid=19965
05/11/2010 03:32:18;0008;   pbs_mom;Job;10268969.pbs02.pic.es;job was
terminated
05/11/2010 03:32:18;0080;   pbs_mom;Job;10327910.pbs02.pic.es;removing
transient job directory /home/tmp/10327910.pbs02.pic.es
05/11/2010 03:32:18;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
05/11/2010 03:32:18;0080;  
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
while loop
05/11/2010 03:32:18;0001;   pbs_mom;Job;10268969.pbs02.pic.es;preobit_reply,
unknown on server, deleting locally
05/11/2010 03:32:18;0080;   pbs_mom;Job;10268969.pbs02.pic.es;removing
transient job directory /home/tmp/10268969.pbs02.pic.es

torque version:

# rpm -qa|grep torque
torque-2.3.6-2cri.el5.x86_64
torque-mom-2.3.6-2cri.el5.x86_64
torque-client-2.3.6-2cri.el5.x86_64
Comment 1 Arnau 2010-05-26 07:05:26 MDT
Hi all,

Sorry, it's not prologue but epilogue. 

This is our code:

# cat epilogue 
#!/bin/bash

#argv[1]      job id
#argv[2]     job execution user name
#argv[3]     job execution group name
#argv[4]     job name
#argv[5]     session id
#argv[6]     list of requested resource limits
#argv[7]     list of resources used by job
#argv[8]     job execution queue
#argv[9]     job account
#argv[10]     job exit code

if [ -n $1 ]
then
    rm -rf /home/tmp/$1
fi
exit 0


Cheers,
Arnau
Comment 2 Chris Samuel 2010-06-06 23:36:50 MDT
The thing is that it's not the epilogue that's removing the file, it's the
pbs_mom program itself according to your audit log - very odd!
Comment 3 Eygene Ryabinkin 2010-08-11 00:50:24 MDT
As per the discussion on the torquedev mailing list
(http://www.supercluster.org/pipermail/torquedev/2010-August/002734.html), I
had created the patch that will change plain unlink() calls to the
job_unlink_file() calls that will act with the job user's credentials.

This should help one to solve the disappearence of the /dev/null devices and
will generally prohibit to remove files that aren't owned by the job user
credentials.
Comment 4 Eygene Ryabinkin 2010-08-11 00:52:47 MDT
Created an attachment (id=51) [details]
Patch for Torque 2.5.2 that invokes unlink() with job owner credentials

This patch is now being tested on our production cluster with the Torque
2.5.2-snap.201008091147, but it should also fit recent Torque releases as well.
Comment 5 Ken Nielson 2010-08-20 16:43:04 MDT
Patch has been merged and committed to 2.5-fixes.
Comment 6 Chris Samuel 2010-08-22 18:15:20 MDT
As 2.4 has replaced 2.3 as the "ultra-stable fixes only" branch (RIP 2.3) can
this be backported to 2.4 please ?
Comment 7 Ken Nielson 2010-09-07 14:15:43 MDT
Patch added to 2.4.11.
Comment 8 Alexa 2014-02-02 19:30:56 MST
*** Bug 260998 has been marked as a duplicate of this bug. ***
Seen live from the domain http://volichat.com/adult-chat-rooms
Marked for reference. Resolved as fixed @bugzilla.