Bugzilla – Bug 61
pbs_mom changing /dev/null mode and perms
Last modified: 2010-09-07 14:15:43 MDT
You need to log in before you can comment on or make changes to this bug.
pbs_mom changes /dev/null mode and perms when running prologue in some unknown circunstances: it goes from: 0 crw-rw-rw- 1 root root 1, 3 May 10 12:31 /dev/null to 0 -rw-r--r-- 1 root root 0 May 11 10:42 /dev/null audit logs: type=PATH msg=audit(05/11/2010 03:32:18.394:111668) : item=1 name=/dev/null inode=2085 dev=00:11 mode=character,666 ouid=root ogid=root rdev=01:03 type=PATH msg=audit(05/11/2010 03:32:18.394:111668) : item=0 name=/dev/ inode=1120 dev=00:11 mode=dir,755 ouid=root ogid=root rdev=00:00 type=CWD msg=audit(05/11/2010 03:32:18.394:111668) : cwd=/var/spoo /pbs/mom_priv type=SYSCALL msg=audit(05/11/2010 03:32:18.394:111668) : arch=x86_64 syscall=unlink success=yes exit=0 a0=6a17a0 a1=15c724b6 a2=15c724ac a3=726576726573206e items=2 ppid=1 pid=18208 auid=root uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=7327 comm=pbs_mom exe=/usr/sbin/pbs_mom key=NULL_touch ---- notice the ulink syscall. It happenend when: # grep 03:32 /var/spool/pbs/mom_logs/20100511 05/11/2010 03:32:17;0080; pbs_mom;Job;10327910.pbs02.pic.es;scan_for_terminated: job 10327910.pbs02.pic.es task 1 terminated, sid=3571 05/11/2010 03:32:17;0008; pbs_mom;Job;10327910.pbs02.pic.es;job was terminated 05/11/2010 03:32:17;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 05/11/2010 03:32:17;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 05/11/2010 03:32:17;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 05/11/2010 03:32:17;0008; pbs_mom;Job;10327910.pbs02.pic.es;checking job post-processing routine 05/11/2010 03:32:17;0080; pbs_mom;Job;10327910.pbs02.pic.es;obit sent to server 05/11/2010 03:32:18;0080; pbs_mom;Job;10268969.pbs02.pic.es;scan_for_terminated: job 10268969.pbs02.pic.es task 1 terminated, sid=19965 05/11/2010 03:32:18;0008; pbs_mom;Job;10268969.pbs02.pic.es;job was terminated 05/11/2010 03:32:18;0080; pbs_mom;Job;10327910.pbs02.pic.es;removing transient job directory /home/tmp/10327910.pbs02.pic.es 05/11/2010 03:32:18;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 05/11/2010 03:32:18;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 05/11/2010 03:32:18;0001; pbs_mom;Job;10268969.pbs02.pic.es;preobit_reply, unknown on server, deleting locally 05/11/2010 03:32:18;0080; pbs_mom;Job;10268969.pbs02.pic.es;removing transient job directory /home/tmp/10268969.pbs02.pic.es torque version: # rpm -qa|grep torque torque-2.3.6-2cri.el5.x86_64 torque-mom-2.3.6-2cri.el5.x86_64 torque-client-2.3.6-2cri.el5.x86_64
Hi all, Sorry, it's not prologue but epilogue. This is our code: # cat epilogue #!/bin/bash #argv[1] job id #argv[2] job execution user name #argv[3] job execution group name #argv[4] job name #argv[5] session id #argv[6] list of requested resource limits #argv[7] list of resources used by job #argv[8] job execution queue #argv[9] job account #argv[10] job exit code if [ -n $1 ] then rm -rf /home/tmp/$1 fi exit 0 Cheers, Arnau
The thing is that it's not the epilogue that's removing the file, it's the pbs_mom program itself according to your audit log - very odd!
As per the discussion on the torquedev mailing list (http://www.supercluster.org/pipermail/torquedev/2010-August/002734.html), I had created the patch that will change plain unlink() calls to the job_unlink_file() calls that will act with the job user's credentials. This should help one to solve the disappearence of the /dev/null devices and will generally prohibit to remove files that aren't owned by the job user credentials.
Created an attachment (id=51) [details] Patch for Torque 2.5.2 that invokes unlink() with job owner credentials This patch is now being tested on our production cluster with the Torque 2.5.2-snap.201008091147, but it should also fit recent Torque releases as well.
Patch has been merged and committed to 2.5-fixes.
As 2.4 has replaced 2.3 as the "ultra-stable fixes only" branch (RIP 2.3) can this be backported to 2.4 please ?
Patch added to 2.4.11.