Bug 195 - cpuset VFS path change for 3.x kernels
: cpuset VFS path change for 3.x kernels
Status: NEW
Product: TORQUE
pbs_mom
: 3.0.x
: PC Linux
: P5 normal
Assigned To: Ken Nielson
:
:
:
  Show dependency treegraph
 
Reported: 2012-05-11 16:58 MDT by Andrew Savchenko
Modified: 2013-07-23 14:38 MDT (History)
5 users (show)

See Also:


Attachments
support for new cpuset filenames (4.98 KB, patch)
2012-05-11 16:58 MDT, Andrew Savchenko
Details | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description Andrew Savchenko 2012-05-11 16:58:27 MDT
Created an attachment (id=107) [details]
support for new cpuset filenames

Hello,

this is a continuation of the following problem:
http://www.clusterresources.com/pipermail/torqueusers/2012-March/014336.html

I have the very same problem on Gentoo with 3.2.14 vanilla kernel and
torque-3.0.5, but a solution above doesn't help.

Any job fails to run because pbs_mom is unable to create a cpuset for
a job, pbs_mom.log:

05/01/2012 04:09:11;0001;
pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint,
FALSE 05/01/2012 04:09:11;0001;   pbs_mom;Job;TMomFinalizeJob3;job
not started, Retry job exec failure, retry will be attempted (see
syslog for more information) 05/01/2012 04:09:11;0001;
pbs_mom;Job;5.master;ALERT:  job failed phase 3 start 05/01/2012
04:09:11;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
for job 5.master 05/01/2012 04:09:11;0080;
pbs_mom;Svr;preobit_reply;top of preobit_reply 05/01/2012
04:09:11;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop 05/01/2012 04:09:11;0080;
pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
05/01/2012 04:09:11;0080;   pbs_mom;Job;5.master;obit sent to server
05/01/2012 04:09:12;0080;   pbs_mom;Job;5.master;removed job script

And in syslog:
May 01 04:09:11 [pbs_mom] LOG_ERROR::TMomFinalizeChild, Could not
create cpuset for job 5.master

/sys/fs/cgroup/cpuset and /dev/cpuset are both mounted as cpuset
filesystem type:

$ mount | egrep "cpuset|cgroup"
cgroup_root on /sys/fs/cgroup type tmpfs
(rw,nosuid,nodev,noexec,relatime,size=10240k,mode=755) openrc
on /sys/fs/cgroup/openrc type cgroup
(rw,nosuid,nodev,noexec,relatime,release_agent=/lib64/rc/sh/cgroup-release-agent.sh,name=openrc)
none on /dev/cpuset type cpuset (rw)
- on /sys/fs/cgroup/cpuset type cpuset (rw)

And their content is the same with "cpuset." prefix.

It looks like this change was made in 3.0 kernel, at least in works on
2.6.38 and fails on 3.2.14 kernel. Kernel's Documentation/cgroups/cpuset.txt
since kernel 3.0.y says that "cpuset." prefix must be used.

I wrote a patch to account path changes depending on the linux kernel
version. I verified that with this patch tasks are running and CPU
restrictions are enforced by the sceduler.
Comment 1 Chris Samuel 2012-05-13 22:29:23 MDT
Can you do an ls of your /dev/cpuset mount please ?

I've just had a look with the 3.2 kernel on my Ubuntu laptop and when I do:

mkdir /dev/cpuset
mount -t cpuset - /dev/cpuset

I see:

oot@eris:~# ls -1 /dev/cpuset
cgroup.clone_children
cgroup.event_control
cgroup.procs
cpu_exclusive
cpus
mem_exclusive
mem_hardwall
memory_migrate
memory_pressure
memory_pressure_enabled
memory_spread_page
memory_spread_slab
mems
notify_on_release
release_agent
sched_load_balance
sched_relax_domain_level
tasks

and the usual routine of:

mkdir foo
cd foo
echo 0-1 > cpus
echo 0 > mems
echo $$ > tasks

all works, which is basically all that Torque does.
Comment 2 David Singleton 2012-05-13 22:48:17 MDT
Chris,

If you run mount, you'll see your cpuset vfs is mounted with the noprefix
option. The "modern way" is to mount -t cgroup -o cpuset  in which case
you'll end up with the "cpuset." prefix on cpuset attributes. 

David
Comment 3 Chris Samuel 2012-05-13 23:05:14 MDT
Hi David,

But if you want Torque to work unmodified you shouldn't do that. :-)

Breaking userspace is a bad thing so the noprefix behaviour is unlikely to go
away - here's a rant from Linus back in March on his attitude to breaking user
apps..

https://lkml.org/lkml/2012/3/8/495
Comment 4 Andrew Savchenko 2012-05-14 05:57:31 MDT
I do not use noprefix option, thus ls shows "cpuset." prefixes.

There is no such thing as a stable kernel API and there are good reasons for
this. New applications will eventually use modern way of handling things, so
torque should adapt as well otherwise conflicts will occur sooner or later.

Anyway if you plan to stick to old file names at least for a while, please put
somewhere in the documentation, that people should use -o noprefix.
Comment 5 Chris Samuel 2012-05-24 21:48:57 MDT
(In reply to comment #4)

> I do not use noprefix option, thus ls shows "cpuset." prefixes.

Neither do I, and it ls does not show "cpuset." prefixes.  The reason is that
you already have a cgroup filesystem mounted and I do not.

This change in behaviour is since the Linux kernel commit
f9ab5b5b0f5be506640321d710b0acd3dca6154a "cgroups: forbid noprefix if mounting
more than just cpuset subsystem".

I'll try and find some time to report this as a kernel regression to see what
their attitude to this is - to me it seems like the sort of ABI behaviour
change and consequent user space breakage that Linus hates.

> There is no such thing as a stable kernel API and there are good reasons for
> this.

You are mistaking the *internal* kernel APIs (which are indeed unstable for
very good reason) with the external kernel ABIs exposed to user space and which
have different rules applied.

There has been an attempt to document the level of stability of interfaces in
Documentation/ABI directory (see the README for Greg-KH's reasoning), but as
far as I can tell the cpuset/cgroup stuff has not been added yet.

> New applications will eventually use modern way of handling things, so
> torque should adapt as well otherwise conflicts will occur sooner or later.

Agreed, but Torque will need to know to cope with both cases dynamically.

> Anyway if you plan to stick to old file names at least for a while, please put
> somewhere in the documentation, that people should use -o noprefix.

Sounds like a good idea, I've just tested that on a RHEL5 system and it didn't
complain about not knowing what that meant.
Comment 6 Matt Ezell 2013-07-19 14:22:20 MDT
I'm testing this on a RHEL6 system, and I can't seem to get the cpuset file
system to mount without the prefixes:

# mount |grep cpuset
# mount |grep cgroup
# mount -t cgroup -o cpuset,noprefix none /dev/cpuset
# ls /dev/cpuset|grep cpus
cpuset.cpu_exclusive
cpuset.cpus
cpuset.mem_exclusive
cpuset.mem_hardwall
cpuset.memory_migrate
cpuset.memory_pressure
cpuset.memory_pressure_enabled
cpuset.memory_spread_page
cpuset.memory_spread_slab
cpuset.mems
cpuset.sched_load_balance
cpuset.sched_relax_domain_level
# mount|grep cpuset
none on /dev/cpuset type cgroup (rw,cpuset,noprefix)
# uname -r
2.6.32-358.11.1.el6.x86_64

I'm not sure if I'm doing something wrong, or if my kernel just doesn't
understand 'noprefix'.  Either way, I think TORQUE should support both
syntaxes.

The proposed patch looks for a specific kernel version, but clearly RedHat has
backported cgroups making that check incorrect.
Comment 7 David Beer 2013-07-23 14:37:32 MDT
(In reply to comment #6)
+1 for being annoyed that they'd break user applications. I don't know why
things like this are done.

> I'm not sure if I'm doing something wrong, or if my kernel just doesn't
> understand 'noprefix'.  Either way, I think TORQUE should support both
> syntaxes.
> 
> The proposed patch looks for a specific kernel version, but clearly RedHat has
> backported cgroups making that check incorrect.

We may well need to make this patch lightly more sophisticated to work in all
cases but it is a good patch. I wonder if hwloc already handles this or not?
Does anyone know if this is broken for the 4 series? I assume it is but since
we use hwloc they might solve it for us - anyone can wish, right? 

From Adaptive's perspective we will want to fix this just to avoid the support
calls we'd have to field for not fixing it.
Comment 8 David Beer 2013-07-23 14:38:09 MDT
I also meant to say - we hope to support cgroups at some point so that's
another reason to allow this.