[torquedev] [Bug 195] New: cpuset VFS path change for 3.x kernels

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Fri May 11 16:58:27 MDT 2012


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=195

           Summary: cpuset VFS path change for 3.x kernels
           Product: TORQUE
           Version: 3.0.x
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P5
         Component: pbs_mom
        AssignedTo: knielson at adaptivecomputing.com
        ReportedBy: bircoph at gmail.com
                CC: torquedev at supercluster.org
   Estimated Hours: 0.0


Created an attachment (id=107)
 --> (http://www.clusterresources.com/bugzilla/attachment.cgi?id=107)
support for new cpuset filenames

Hello,

this is a continuation of the following problem:
http://www.clusterresources.com/pipermail/torqueusers/2012-March/014336.html

I have the very same problem on Gentoo with 3.2.14 vanilla kernel and
torque-3.0.5, but a solution above doesn't help.

Any job fails to run because pbs_mom is unable to create a cpuset for
a job, pbs_mom.log:

05/01/2012 04:09:11;0001;
pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint,
FALSE 05/01/2012 04:09:11;0001;   pbs_mom;Job;TMomFinalizeJob3;job
not started, Retry job exec failure, retry will be attempted (see
syslog for more information) 05/01/2012 04:09:11;0001;
pbs_mom;Job;5.master;ALERT:  job failed phase 3 start 05/01/2012
04:09:11;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
for job 5.master 05/01/2012 04:09:11;0080;
pbs_mom;Svr;preobit_reply;top of preobit_reply 05/01/2012
04:09:11;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop 05/01/2012 04:09:11;0080;
pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
05/01/2012 04:09:11;0080;   pbs_mom;Job;5.master;obit sent to server
05/01/2012 04:09:12;0080;   pbs_mom;Job;5.master;removed job script

And in syslog:
May 01 04:09:11 [pbs_mom] LOG_ERROR::TMomFinalizeChild, Could not
create cpuset for job 5.master

/sys/fs/cgroup/cpuset and /dev/cpuset are both mounted as cpuset
filesystem type:

$ mount | egrep "cpuset|cgroup"
cgroup_root on /sys/fs/cgroup type tmpfs
(rw,nosuid,nodev,noexec,relatime,size=10240k,mode=755) openrc
on /sys/fs/cgroup/openrc type cgroup
(rw,nosuid,nodev,noexec,relatime,release_agent=/lib64/rc/sh/cgroup-release-agent.sh,name=openrc)
none on /dev/cpuset type cpuset (rw)
- on /sys/fs/cgroup/cpuset type cpuset (rw)

And their content is the same with "cpuset." prefix.

It looks like this change was made in 3.0 kernel, at least in works on
2.6.38 and fails on 3.2.14 kernel. Kernel's Documentation/cgroups/cpuset.txt
since kernel 3.0.y says that "cpuset." prefix must be used.

I wrote a patch to account path changes depending on the linux kernel
version. I verified that with this patch tasks are running and CPU
restrictions are enforced by the sceduler.

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list