Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *   Generic  process - grouping  system . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *   Based  originally  on  the  cpuset  system ,  extracted  by  Paul  Menage 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *   Copyright  ( C )  2006  Google ,  Inc 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2010-03-10 15:22:20 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *   Notifications  support 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *   Copyright  ( C )  2009  Nokia  Corporation 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *   Author :  Kirill  A .  Shutemov 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								 *   Copyright  notices  from  the  original  cpuset  code : 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *   - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *   Copyright  ( C )  2003  BULL  SA . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *   Copyright  ( C )  2004 - 2006  Silicon  Graphics ,  Inc . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *   Portions  derived  from  Patrick  Mochel ' s  sysfs  code . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *   sysfs  is  Copyright  ( c )  2001 - 3  Patrick  Mochel 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *   2003 - 10 - 10  Written  by  Simon  Derr . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *   2003 - 10 - 22  Updates  by  Stephen  Hemminger . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *   2004  May - July  Rework  by  Paul  Jackson . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *   - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *   This  file  is  subject  to  the  terms  and  conditions  of  the  GNU  General  Public 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *   License .   See  the  file  COPYING  in  the  main  directory  of  the  Linux 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *   distribution  for  more  details . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:03 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# define pr_fmt(fmt) KBUILD_MODNAME ": " fmt 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								# include  <linux/cgroup.h> 
  
						 
					
						
							
								
									
										
										
										
											2011-06-02 21:20:51 +10:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# include  <linux/cred.h> 
  
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# include  <linux/ctype.h> 
  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								# include  <linux/errno.h> 
  
						 
					
						
							
								
									
										
										
										
											2011-06-02 21:20:51 +10:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# include  <linux/init_task.h> 
  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								# include  <linux/kernel.h> 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# include  <linux/list.h> 
  
						 
					
						
							
								
									
										
										
										
											2014-04-26 15:40:28 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# include  <linux/magic.h> 
  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								# include  <linux/mm.h> 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# include  <linux/mutex.h> 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# include  <linux/mount.h> 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# include  <linux/pagemap.h> 
  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# include  <linux/proc_fs.h> 
  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								# include  <linux/rcupdate.h> 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# include  <linux/sched.h> 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# include  <linux/slab.h> 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# include  <linux/spinlock.h> 
  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# include  <linux/rwsem.h> 
  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								# include  <linux/string.h> 
  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# include  <linux/sort.h> 
  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# include  <linux/kmod.h> 
  
						 
					
						
							
								
									
										
											 
										
											
												Add cgroupstats
This patch is inspired by the discussion at
http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics
as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.  The
patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward
ported)
This patch implements per cgroup statistics infrastructure and re-uses
code from the taskstats interface.  A new set of cgroup operations are
registered with commands and attributes.  It should be very easy to
*extend* per cgroup statistics, by adding members to the cgroupstats
structure.
The current model for cgroupstats is a pull, a push model (to post
statistics on interesting events), should be very easy to add.  Currently
user space requests for statistics by passing the cgroup file
descriptor.  Statistics about the state of all the tasks in the cgroup
is returned to user space.
TODO's/NOTE:
This patch provides an infrastructure for implementing cgroup statistics.
Based on the needs of each controller, we can incrementally add more statistics,
event based support for notification of statistics, accumulation of taskstats
into cgroup statistics in the future.
Sample output
# ./cgroupstats -C /cgroup/a
sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0
# ./cgroupstats -C /cgroup/
sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0
If the approach looks good, I'll enhance and post the user space utility for
the same
Feedback, comments, test results are always welcome!
[akpm@linux-foundation.org: build fix]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# include  <linux/delayacct.h> 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# include  <linux/cgroupstats.h> 
  
						 
					
						
							
								
									
										
										
										
											2013-01-10 11:49:27 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# include  <linux/hashtable.h> 
  
						 
					
						
							
								
									
										
										
										
											2009-07-29 15:04:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# include  <linux/pid_namespace.h> 
  
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:23 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# include  <linux/idr.h> 
  
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:28 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# include  <linux/vmalloc.h> /* TODO: replace with more sophisticated array */ 
  
						 
					
						
							
								
									
										
										
										
											2012-04-21 09:13:46 +02:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# include  <linux/kthread.h> 
  
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# include  <linux/delay.h> 
  
						 
					
						
							
								
									
										
											 
										
											
												Add cgroupstats
This patch is inspired by the discussion at
http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics
as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.  The
patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward
ported)
This patch implements per cgroup statistics infrastructure and re-uses
code from the taskstats interface.  A new set of cgroup operations are
registered with commands and attributes.  It should be very easy to
*extend* per cgroup statistics, by adding members to the cgroupstats
structure.
The current model for cgroupstats is a pull, a push model (to post
statistics on interesting events), should be very easy to add.  Currently
user space requests for statistics by passing the cgroup file
descriptor.  Statistics about the state of all the tasks in the cgroup
is returned to user space.
TODO's/NOTE:
This patch provides an infrastructure for implementing cgroup statistics.
Based on the needs of each controller, we can incrementally add more statistics,
event based support for notification of statistics, accumulation of taskstats
into cgroup statistics in the future.
Sample output
# ./cgroupstats -C /cgroup/a
sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0
# ./cgroupstats -C /cgroup/
sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0
If the approach looks good, I'll enhance and post the user space utility for
the same
Feedback, comments, test results are always welcome!
[akpm@linux-foundation.org: build fix]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2011-07-26 16:09:06 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# include  <linux/atomic.h> 
  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  pidlists  linger  the  following  amount  before  being  destroyed .   The  goal 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  is  avoiding  frequent  destruction  in  the  middle  of  consecutive  read  calls 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Expiring  in  the  middle  is  a  performance  problem  not  a  correctness  one . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  1  sec  should  be  enough . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# define CGROUP_PIDLIST_DESTROY_DELAY	HZ 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# define CGROUP_FILE_NAME_MAX		(MAX_CGROUP_TYPE_NAMELEN +	\ 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													 MAX_CFTYPE_NAME  +  2 ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2011-12-12 18:12:21 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_mutex  is  the  master  lock .   Any  modification  to  cgroup  or  its 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  hierarchy  must  be  performed  while  holding  it . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  css_set_rwsem  protects  task - > cgroups  pointer ,  the  list  of  css_set 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  objects ,  and  the  chain  of  tasks  off  each  css_set . 
							 
						 
					
						
							
								
									
										
										
										
											2011-12-12 18:12:21 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  These  locks  are  exported  if  CONFIG_PROVE_RCU  so  that  accessors  in 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup . h  can  use  them  for  lockdep  annotations . 
							 
						 
					
						
							
								
									
										
										
										
											2011-12-12 18:12:21 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-07 09:29:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# ifdef CONFIG_PROVE_RCU 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								DEFINE_MUTEX ( cgroup_mutex ) ;  
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								DECLARE_RWSEM ( css_set_rwsem ) ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								EXPORT_SYMBOL_GPL ( cgroup_mutex ) ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								EXPORT_SYMBOL_GPL ( css_set_rwsem ) ;  
						 
					
						
							
								
									
										
										
										
											2013-04-07 09:29:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# else 
  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  DEFINE_MUTEX ( cgroup_mutex ) ;  
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  DECLARE_RWSEM ( css_set_rwsem ) ;  
						 
					
						
							
								
									
										
										
										
											2013-04-07 09:29:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# endif 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  Protects  cgroup_idr  and  css_idr  so  that  IDs  can  be  released  without 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  grabbing  cgroup_mutex . 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  DEFINE_SPINLOCK ( cgroup_idr_lock ) ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Protects  cgroup_subsys - > release_agent_path .   Modifying  it  also  requires 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_mutex .   Reading  requires  either  cgroup_mutex  or  this  spinlock . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  DEFINE_SPINLOCK ( release_agent_path_lock ) ;  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# define cgroup_assert_mutex_or_rcu_locked()				\ 
  
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:55 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									rcu_lockdep_assert ( rcu_read_lock_held ( )  | | 			\
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											   lockdep_is_held ( & cgroup_mutex ) , 		\
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											   " cgroup_mutex or RCU read lock required " ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use a dedicated workqueue for cgroup destruction
Since be44562613851 ("cgroup: remove synchronize_rcu() from
cgroup_diput()"), cgroup destruction path makes use of workqueue.  css
freeing is performed from a work item from that point on and a later
commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two
steps"), moves css offlining to workqueue too.
As cgroup destruction isn't depended upon for memory reclaim, the
destruction work items were put on the system_wq; unfortunately, some
controller may block in the destruction path for considerable duration
while holding cgroup_mutex.  As large part of destruction path is
synchronized through cgroup_mutex, when combined with high rate of
cgroup removals, this has potential to fill up system_wq's max_active
of 256.
Also, it turns out that memcg's css destruction path ends up queueing
and waiting for work items on system_wq through work_on_cpu().  If
such operation happens while system_wq is fully occupied by cgroup
destruction work items, work_on_cpu() can't make forward progress
because system_wq is full and other destruction work items on
system_wq can't make forward progress because the work item waiting
for work_on_cpu() is holding cgroup_mutex, leading to deadlock.
This can be fixed by queueing destruction work items on a separate
workqueue.  This patch creates a dedicated workqueue -
cgroup_destroy_wq - for this purpose.  As these work items shouldn't
have inter-dependencies and mostly serialized by cgroup_mutex anyway,
giving high concurrency level doesn't buy anything and the workqueue's
@max_active is set to 1 so that destruction work items are executed
one by one on each CPU.
Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
cgroup_destroy_wq can't be allocated from cgroup_init().  Do it from a
separate core_initcall().  In the future, we probably want to reorder
so that workqueue init happens before cgroup_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Shawn Bohrer <shawn.bohrer@gmail.com>
Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
Cc: stable@vger.kernel.org # v3.9+
											 
										 
										
											2013-11-22 17:14:39 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup  destruction  makes  heavy  use  of  work  items  and  there  can  be  a  lot 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  of  concurrent  destructions .   Use  a  separate  workqueue  so  that  cgroup 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  destruction  work  items  don ' t  end  up  filling  up  max_active  of  system_wq 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  which  may  lead  to  deadlock . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  struct  workqueue_struct  * cgroup_destroy_wq ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  pidlist  destructions  need  to  be  flushed  on  cgroup  destruction .   Use  a 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  separate  workqueue  as  flush  domain . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  struct  workqueue_struct  * cgroup_pidlist_destroy_wq ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/* generate an array of cgroup subsystem pointers */  
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys, 
  
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  struct  cgroup_subsys  * cgroup_subsys [ ]  =  {  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								# include  <linux/cgroup_subsys.h> 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								} ;  
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# undef SUBSYS 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								/* array of cgroup subsystem names */  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# define SUBSYS(_x) [_x ## _cgrp_id] = #_x, 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  const  char  * cgroup_subsys_name [ ]  =  {  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								# include  <linux/cgroup_subsys.h> 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								} ;  
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# undef SUBSYS 
  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  The  default  hierarchy ,  reserved  for  the  subsystems  that  are  otherwise 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-24 15:21:47 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  unattached  -  it  never  has  more  than  a  single  cgroup ,  and  all  tasks  are 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  part  of  that  cgroup . 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:55 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								struct  cgroup_root  cgrp_dfl_root ;  
						 
					
						
							
								
									
										
										
										
											2013-06-24 15:21:47 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:55 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  The  default  hierarchy  always  exists  but  is  hidden  until  mounted  for  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  first  time .   This  is  for  backward  compatibility . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  bool  cgrp_dfl_root_visible ;  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 19:33:07 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/* some controllers are not supported in the default hierarchy */  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  const  unsigned  int  cgrp_dfl_root_inhibit_ss_mask  =  0  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# ifdef CONFIG_CGROUP_DEBUG 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									|  ( 1  < <  debug_cgrp_id ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# endif 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								/* The list of hierarchy roots */  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-24 15:21:47 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  LIST_HEAD ( cgroup_roots ) ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  int  cgroup_root_count ;  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:37:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/* hierarchy ID allocation and mapping, protected by cgroup_mutex */  
						 
					
						
							
								
									
										
										
										
											2013-04-14 11:36:58 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  DEFINE_IDR ( cgroup_hierarchy_idr ) ;  
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:23 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-18 18:53:53 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:49 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  Assign  a  monotonically  increasing  serial  number  to  csses .   It  guarantees 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroups  with  bigger  numbers  are  newer  than  those  with  smaller  numbers . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Also ,  as  csses  are  always  appended  to  the  parent ' s  - > children  list ,  it 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  guarantees  that  sibling  csses  are  always  sorted  in  the  ascending  serial 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  number  order  on  the  list .   Protected  by  cgroup_mutex . 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-18 18:53:53 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:49 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  u64  css_serial_nr_next  =  1 ;  
						 
					
						
							
								
									
										
										
										
											2013-06-18 18:53:53 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								/* This flag indicates whether tasks in the fork and exit paths should
  
						 
					
						
							
								
									
										
										
										
											2008-02-23 15:24:09 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  check  for  fork / exit  handlers  to  call .  This  avoids  us  having  to  do 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  extra  work  in  the  fork / exit  path  if  none  of  the  subsystems  need  to 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  be  called . 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:46:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  need_forkexit_callback  __read_mostly ;  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 16:24:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  struct  cftype  cgroup_base_files [ ] ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  cgroup_put ( struct  cgroup  * cgrp ) ;  
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  rebind_subsystems ( struct  cgroup_root  * dst_root ,  
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											     unsigned  int  ss_mask ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:37 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_destroy_locked ( struct  cgroup  * cgrp ) ;  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  create_css ( struct  cgroup  * cgrp ,  struct  cgroup_subsys  * ss ) ;  
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  css_release ( struct  percpu_ref  * ref ) ;  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  kill_css ( struct  cgroup_subsys_state  * css ) ;  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_addrm_files ( struct  cgroup  * cgrp ,  struct  cftype  cfts [ ] ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											      bool  is_add ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  cgroup_pidlist_destroy_all ( struct  cgroup  * cgrp ) ;  
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:37 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/* IDR wrappers which synchronize using cgroup_idr_lock */  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  int  cgroup_idr_alloc ( struct  idr  * idr ,  void  * ptr ,  int  start ,  int  end ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											    gfp_t  gfp_mask ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									int  ret ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									idr_preload ( gfp_mask ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:10:59 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									spin_lock_bh ( & cgroup_idr_lock ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									ret  =  idr_alloc ( idr ,  ptr ,  start ,  end ,  gfp_mask ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:10:59 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									spin_unlock_bh ( & cgroup_idr_lock ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									idr_preload_end ( ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  ret ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  void  * cgroup_idr_replace ( struct  idr  * idr ,  void  * ptr ,  int  id )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									void  * ret ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:10:59 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									spin_lock_bh ( & cgroup_idr_lock ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									ret  =  idr_replace ( idr ,  ptr ,  id ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:10:59 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									spin_unlock_bh ( & cgroup_idr_lock ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  ret ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  void  cgroup_idr_remove ( struct  idr  * idr ,  int  id )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:10:59 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									spin_lock_bh ( & cgroup_idr_lock ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									idr_remove ( idr ,  id ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:10:59 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									spin_unlock_bh ( & cgroup_idr_lock ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  struct  cgroup  * cgroup_parent ( struct  cgroup  * cgrp )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * parent_css  =  cgrp - > self . parent ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( parent_css ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  container_of ( parent_css ,  struct  cgroup ,  self ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  NULL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:27 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_css  -  obtain  a  cgroup ' s  css  for  the  specified  subsystem 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cgrp :  the  cgroup  of  interest 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:00 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ ss :  the  subsystem  of  interest  ( % NULL  returns  @ cgrp - > self ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:27 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-26 18:40:56 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  Return  @ cgrp ' s  css  ( cgroup_subsys_state )  associated  with  @ ss .   This 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  function  must  be  called  either  under  cgroup_mutex  or  rcu_read_lock ( )  and 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  the  caller  is  responsible  for  pinning  the  returned  css  if  it  wants  to 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  keep  accessing  it  outside  the  said  locks .   This  function  may  return 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  % NULL  if  @ cgrp  doesn ' t  have  @ subsys_id  enabled . 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:27 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  struct  cgroup_subsys_state  * cgroup_css ( struct  cgroup  * cgrp ,  
						 
					
						
							
								
									
										
										
										
											2013-08-26 18:40:56 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
													      struct  cgroup_subsys  * ss ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:27 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-08-26 18:40:56 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ss ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  rcu_dereference_check ( cgrp - > subsys [ ss - > id ] , 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce cgroup_tree_mutex
Currently cgroup uses combination of inode->i_mutex'es and
cgroup_mutex for synchronization.  With the scheduled kernfs
conversion, i_mutex'es will be removed.  Unfortunately, just using
cgroup_mutex isn't possible.  All kernfs file and syscall operations,
most of which require grabbing cgroup_mutex, will be called with
kernfs active ref held and, if we try to perform kernfs removals under
cgroup_mutex, it can deadlock as kernfs_remove() tries to drain the
target node.
Let's introduce a new outer mutex, cgroup_tree_mutex, which protects
stuff used during hierarchy changing operations - cftypes and all the
operations which may affect the cgroupfs.  It also covers css
association and iteration.  This allows cgroup_css(), for_each_css()
and other css iterators to be called under cgroup_tree_mutex.  The new
mutex will nest above both kernfs's active ref protection and
cgroup_mutex.  By protecting tree modifications with a separate outer
mutex, we can get rid of the forementioned deadlock condition.
Actual file additions and removals now require cgroup_tree_mutex
instead of cgroup_mutex.  Currently, cgroup_tree_mutex is never used
without cgroup_mutex; however, we'll soon add hierarchy modification
sections which are only protected by cgroup_tree_mutex.  In the
future, we might want to make the locking more granular by better
splitting the coverages of the two mutexes.  For now, this should do.
v2: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-11 11:52:47 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
													lockdep_is_held ( & cgroup_mutex ) ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-26 18:40:56 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									else 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:00 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  & cgrp - > self ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:27 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:37 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css.  This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups.  compare_css_sets()
already compares both, not for correctness but for optimization.  As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_e_css  -  obtain  a  cgroup ' s  effective  css  for  the  specified  subsystem 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cgrp :  the  cgroup  of  interest 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:00 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ ss :  the  subsystem  of  interest  ( % NULL  returns  @ cgrp - > self ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css.  This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups.  compare_css_sets()
already compares both, not for correctness but for optimization.  As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Similar  to  cgroup_css ( )  but  returns  the  effctive  css ,  which  is  defined 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  as  the  matching  css  of  the  nearest  ancestor  including  self  which  has  @ ss 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  enabled .   If  @ ss  is  associated  with  the  hierarchy  @ cgrp  is  on ,  this 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  function  is  guaranteed  to  return  non - NULL  css . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  struct  cgroup_subsys_state  * cgroup_e_css ( struct  cgroup  * cgrp ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
														struct  cgroup_subsys  * ss ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! ss ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:00 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  & cgrp - > self ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css.  This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups.  compare_css_sets()
already compares both, not for correctness but for optimization.  As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! ( cgrp - > root - > subsys_mask  &  ( 1  < <  ss - > id ) ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  NULL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									while  ( cgroup_parent ( cgrp )  & & 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									       ! ( cgroup_parent ( cgrp ) - > child_subsys_mask  &  ( 1  < <  ss - > id ) ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgrp  =  cgroup_parent ( cgrp ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css.  This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups.  compare_css_sets()
already compares both, not for correctness but for optimization.  As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  cgroup_css ( cgrp ,  ss ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:27 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:37 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								/* convenient tests for these bits */  
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:53 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  inline  bool  cgroup_is_dead ( const  struct  cgroup  * cgrp )  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  ! ( cgrp - > self . flags  &  CSS_ONLINE ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:21 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								struct  cgroup_subsys_state  * of_css ( struct  kernfs_open_file  * of )  
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp  =  of - > kn - > parent - > priv ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:21 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cftype  * cft  =  of_cft ( of ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  This  is  open  and  unprotected  implementation  of  cgroup_css ( ) . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  seq_css ( )  is  only  called  from  a  kernfs  file  operation  which  has 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  an  active  reference  on  the  file .   Because  all  the  subsystem 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  files  are  drained  before  a  css  is  disassociated  with  a  cgroup , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  the  matching  css  from  the  cgroup ' s  subsys  table  is  guaranteed  to 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  be  and  stay  valid  until  the  enclosing  operation  is  complete . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( cft - > ss ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  rcu_dereference_raw ( cgrp - > subsys [ cft - > ss - > id ] ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									else 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:00 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  & cgrp - > self ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:21 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								EXPORT_SYMBOL_GPL ( of_css ) ;  
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-04-08 19:00:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_is_descendant  -  test  ancestry 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cgrp :  the  cgroup  to  be  tested 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ ancestor :  possible  ancestor  of  @ cgrp 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Test  whether  @ cgrp  is  a  descendant  of  @ ancestor .   It  also  returns  % true 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  if  @ cgrp  = =  @ ancestor .   This  function  is  safe  to  call  as  long  as  @ cgrp 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  and  @ ancestor  are  accessible . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								bool  cgroup_is_descendant ( struct  cgroup  * cgrp ,  struct  cgroup  * ancestor )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									while  ( cgrp )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( cgrp  = =  ancestor ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											return  true ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										cgrp  =  cgroup_parent ( cgrp ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-08 19:00:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  false ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2008-02-07 00:13:46 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_is_releasable ( const  struct  cgroup  * cgrp )  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									const  int  bits  = 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:40:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										( 1  < <  CGRP_RELEASABLE )  | 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										( 1  < <  CGRP_NOTIFY_ON_RELEASE ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  ( cgrp - > flags  &  bits )  = =  bits ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2008-02-07 00:13:46 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  notify_on_release ( const  struct  cgroup  * cgrp )  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:40:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  test_bit ( CGRP_NOTIFY_ON_RELEASE ,  & cgrp - > flags ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  for_each_css  -  iterate  all  css ' s  of  a  cgroup 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ css :  the  iteration  cursor 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ ssid :  the  index  of  the  subsystem ,  CGROUP_SUBSYS_COUNT  after  reaching  the  end 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cgrp :  the  target  cgroup  to  iterate  css ' s  of 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css.  This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups.  compare_css_sets()
already compares both, not for correctness but for optimization.  As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  Should  be  called  under  cgroup_ [ tree_ ] mutex . 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# define for_each_css(css, ssid, cgrp)					\ 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									for  ( ( ssid )  =  0 ;  ( ssid )  <  CGROUP_SUBSYS_COUNT ;  ( ssid ) + + ) 	\
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! ( ( css )  =  rcu_dereference_check ( 			\
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												( cgrp ) - > subsys [ ( ssid ) ] , 			\
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												lockdep_is_held ( & cgroup_mutex ) ) ) )  {  } 	\
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										else 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css.  This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups.  compare_css_sets()
already compares both, not for correctness but for optimization.  As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  for_each_e_css  -  iterate  all  effective  css ' s  of  a  cgroup 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ css :  the  iteration  cursor 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ ssid :  the  index  of  the  subsystem ,  CGROUP_SUBSYS_COUNT  after  reaching  the  end 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cgrp :  the  target  cgroup  to  iterate  css ' s  of 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Should  be  called  under  cgroup_ [ tree_ ] mutex . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# define for_each_e_css(css, ssid, cgrp)					\ 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									for  ( ( ssid )  =  0 ;  ( ssid )  <  CGROUP_SUBSYS_COUNT ;  ( ssid ) + + ) 	\
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! ( ( css )  =  cgroup_e_css ( cgrp ,  cgroup_subsys [ ( ssid ) ] ) ) )  \
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											; 						\
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										else 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  for_each_subsys  -  iterate  all  enabled  cgroup  subsystems 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ ss :  the  iteration  cursor 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ ssid :  the  index  of  @ ss ,  CGROUP_SUBSYS_COUNT  after  reaching  the  end 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# define for_each_subsys(ss, ssid)					\ 
  
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for  ( ( ssid )  =  0 ;  ( ssid )  <  CGROUP_SUBSYS_COUNT  & & 		\
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									     ( ( ( ss )  =  cgroup_subsys [ ssid ] )  | |  true ) ;  ( ssid ) + + ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/* iterate across the hierarchies */  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# define for_each_root(root)						\ 
  
						 
					
						
							
								
									
										
										
										
											2013-06-24 15:21:48 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									list_for_each_entry ( ( root ) ,  & cgroup_roots ,  root_list ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/* iterate over child cgrps, lock should be held throughout iteration */  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# define cgroup_for_each_live_child(child, cgrp)				\ 
  
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									list_for_each_entry ( ( child ) ,  & ( cgrp ) - > self . children ,  self . sibling )  \
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ( {  lockdep_assert_held ( & cgroup_mutex ) ; 		\
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										       cgroup_is_dead ( child ) ;  } ) ) 			\
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											; 						\
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										else 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-07 09:29:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/* the list of cgroups eligible for automatic release. Protected by
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  release_list_lock  */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  LIST_HEAD ( release_list ) ;  
						 
					
						
							
								
									
										
										
										
											2009-07-25 16:47:45 +02:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  DEFINE_RAW_SPINLOCK ( release_list_lock ) ;  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  cgroup_release_agent ( struct  work_struct  * work ) ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  DECLARE_WORK ( release_agent_work ,  cgroup_release_agent ) ;  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:40:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  check_for_release ( struct  cgroup  * cgrp ) ;  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  A  cgroup  can  be  associated  with  multiple  css_sets  as  different  tasks  may 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  belong  to  different  cgroups  on  different  hierarchies .   In  the  other 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  direction ,  a  css_set  is  naturally  associated  with  multiple  cgroups . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  This  M : N  relationship  is  represented  by  the  following  link  structure 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  which  exists  for  each  association  and  allows  traversing  the  associations 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  from  both  sides . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								struct  cgrp_cset_link  {  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* the cgroup and css_set this link associates */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup 		* cgrp ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  css_set 		* cset ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* list of cgrp_cset_links anchored at cgrp->cset_links */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  list_head 	cset_link ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* list of cgrp_cset_links anchored at css_set->cgrp_links */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  list_head 	cgrp_link ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								} ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  The  default  css_set  -  used  by  init  and  its  children  prior  to  any 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  hierarchies  being  mounted .  It  contains  a  pointer  to  the  root  state 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  for  each  subsystem .  Also  used  to  anchor  the  list  of  css_sets .  Not 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  reference - counted ,  to  improve  performance  when  child  cgroups 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  haven ' t  been  created . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-07 21:31:17 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								struct  css_set  init_css_set  =  {  
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									. refcount 		=  ATOMIC_INIT ( 1 ) , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. cgrp_links 		=  LIST_HEAD_INIT ( init_css_set . cgrp_links ) , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. tasks 			=  LIST_HEAD_INIT ( init_css_set . tasks ) , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. mg_tasks 		=  LIST_HEAD_INIT ( init_css_set . mg_tasks ) , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. mg_preload_node 	=  LIST_HEAD_INIT ( init_css_set . mg_preload_node ) , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. mg_node 		=  LIST_HEAD_INIT ( init_css_set . mg_node ) , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								} ;  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  css_set_count 	=  1 ; 	/* 1 for init_css_set */  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_update_populated  -  updated  populated  count  of  a  cgroup 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cgrp :  the  target  cgroup 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ populated :  inc  or  dec  populated  count 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cgrp  is  either  getting  the  first  task  ( css_set )  or  losing  the  last . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Update  @ cgrp - > populated_cnt  accordingly .   The  count  is  propagated 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  towards  root  so  that  a  given  cgroup ' s  populated_cnt  is  zero  iff  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup  and  all  its  descendants  are  empty . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cgrp ' s  interface  file  " cgroup.populated "  is  zero  if 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cgrp - > populated_cnt  is  zero  and  1  otherwise .   When  @ cgrp - > populated_cnt 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  changes  from  or  to  zero ,  userland  is  notified  that  the  content  of  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  interface  file  has  changed .   This  can  be  used  to  detect  when  @ cgrp  and 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  its  descendants  become  populated  or  empty . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  void  cgroup_update_populated ( struct  cgroup  * cgrp ,  bool  populated )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									lockdep_assert_held ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									do  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										bool  trigger ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( populated ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											trigger  =  ! cgrp - > populated_cnt + + ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										else 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											trigger  =  ! - - cgrp - > populated_cnt ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! trigger ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											break ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( cgrp - > populated_kn ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											kernfs_notify ( cgrp - > populated_kn ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										cgrp  =  cgroup_parent ( cgrp ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									}  while  ( cgrp ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  hash  table  for  cgroup  groups .  This  improves  the  performance  to  find 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  an  existing  css_set .  This  hash  doesn ' t  ( currently )  take  into 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  account  cgroups  in  empty  hierarchies . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# define CSS_SET_HASH_BITS	7 
  
						 
					
						
							
								
									
										
										
										
											2013-01-10 11:49:27 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  DEFINE_HASHTABLE ( css_set_table ,  CSS_SET_HASH_BITS ) ;  
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-01-10 11:49:27 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  unsigned  long  css_set_hash ( struct  cgroup_subsys_state  * css [ ] )  
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-01-10 11:49:27 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									unsigned  long  key  =  0UL ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									int  i ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for_each_subsys ( ss ,  i ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-01-10 11:49:27 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										key  + =  ( unsigned  long ) css [ i ] ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									key  =  ( key  > >  16 )  ^  key ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-01-10 11:49:27 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  key ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  put_css_set_locked ( struct  css_set  * cset ,  bool  taskexit )  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgrp_cset_link  * link ,  * tmp_link ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									int  ssid ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									lockdep_assert_held ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! atomic_dec_and_test ( & cset - > refcount ) ) 
							 
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:03 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:23 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* This css_set is dead. unlink it and release cgroup refcounts */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for_each_subsys ( ss ,  ssid ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										list_del ( & cset - > e_cset_node [ ssid ] ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									hash_del ( & cset - > hlist ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:23 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									css_set_count - - ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									list_for_each_entry_safe ( link ,  tmp_link ,  & cset - > cgrp_links ,  cgrp_link )  { 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:23 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										struct  cgroup  * cgrp  =  link - > cgrp ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										list_del ( & link - > cset_link ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										list_del ( & link - > cgrp_link ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-01-24 14:43:28 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										/* @cgrp can't go away while we're holding css_set_rwsem */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( list_empty ( & cgrp - > cset_links ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											cgroup_update_populated ( cgrp ,  false ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( notify_on_release ( cgrp ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												if  ( taskexit ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													set_bit ( CGRP_RELEASABLE ,  & cgrp - > flags ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												check_for_release ( cgrp ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											} 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:23 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										kfree ( link ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:23 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									kfree_rcu ( cset ,  rcu_head ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  put_css_set ( struct  css_set  * cset ,  bool  taskexit )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Ensure  that  the  refcount  doesn ' t  hit  zero  while  any  readers 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  can  see  it .  Similar  to  atomic_dec_and_lock ( ) ,  but  for  an 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  rwlock 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( atomic_add_unless ( & cset - > refcount ,  - 1 ,  1 ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									down_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									put_css_set_locked ( cset ,  taskexit ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									up_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  refcounted  get / put  for  css_set  objects 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  inline  void  get_css_set ( struct  css_set  * cset )  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									atomic_inc ( & cset - > refcount ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-24 15:21:48 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  compare_css_sets  -  helper  function  for  find_existing_css_set ( ) . 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ cset :  candidate  css_set  being  tested 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ old_cset :  existing  css_set  for  a  task 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ new_cgrp :  cgroup  that ' s  being  entered  by  the  task 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ template :  desired  set  of  css  pointers  in  css_set  ( pre - calculated ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2013-07-31 16:18:36 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  Returns  true  if  " cset "  matches  " old_cset "  except  for  the  hierarchy 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  which  " new_cgrp "  belongs  to ,  for  which  it  should  match  " new_cgrp " . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  bool  compare_css_sets ( struct  css_set  * cset ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											     struct  css_set  * old_cset , 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											     struct  cgroup  * new_cgrp , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											     struct  cgroup_subsys_state  * template [ ] ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  list_head  * l1 ,  * l2 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css.  This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups.  compare_css_sets()
already compares both, not for correctness but for optimization.  As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  On  the  default  hierarchy ,  there  can  be  csets  which  are 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  associated  with  the  same  set  of  cgroups  but  different  csses . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Let ' s  first  ensure  that  csses  match . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( memcmp ( template ,  cset - > subsys ,  sizeof ( cset - > subsys ) ) ) 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  false ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Compare  cgroup  pointers  in  order  to  distinguish  between 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css.  This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups.  compare_css_sets()
already compares both, not for correctness but for optimization.  As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  different  cgroups  in  hierarchies .   As  different  cgroups  may 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  share  the  same  effective  css ,  this  comparison  is  always 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  necessary . 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									l1  =  & cset - > cgrp_links ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									l2  =  & old_cset - > cgrp_links ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									while  ( 1 )  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										struct  cgrp_cset_link  * link1 ,  * link2 ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										struct  cgroup  * cgrp1 ,  * cgrp2 ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										l1  =  l1 - > next ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										l2  =  l2 - > next ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/* See if we reached the end - both lists are equal length. */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( l1  = =  & cset - > cgrp_links )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											BUG_ON ( l2  ! =  & old_cset - > cgrp_links ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											break ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										}  else  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											BUG_ON ( l2  = =  & old_cset - > cgrp_links ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/* Locate the cgroups associated with these links. */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										link1  =  list_entry ( l1 ,  struct  cgrp_cset_link ,  cgrp_link ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										link2  =  list_entry ( l2 ,  struct  cgrp_cset_link ,  cgrp_link ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgrp1  =  link1 - > cgrp ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgrp2  =  link2 - > cgrp ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										/* Hierarchies should be linked in the same order. */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										BUG_ON ( cgrp1 - > root  ! =  cgrp2 - > root ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  If  this  hierarchy  is  the  hierarchy  of  the  cgroup 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  that ' s  changing ,  then  we  need  to  check  that  this 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  css_set  points  to  the  new  cgroup ;  if  it ' s  any  other 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  hierarchy ,  then  this  css_set  should  point  to  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  same  cgroup  as  the  old  css_set . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( cgrp1 - > root  = =  new_cgrp - > root )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( cgrp1  ! =  new_cgrp ) 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												return  false ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										}  else  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											if  ( cgrp1  ! =  cgrp2 ) 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												return  false ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  true ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-24 15:21:48 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  find_existing_css_set  -  init  css  array  and  find  the  matching  css_set 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ old_cset :  the  css_set  that  we ' re  using  before  the  cgroup  transition 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cgrp :  the  cgroup  that  we ' re  moving  into 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ template :  out  param  for  the  new  set  of  csses ,  should  be  clear  on  entry 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  struct  css_set  * find_existing_css_set ( struct  css_set  * old_cset ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													struct  cgroup  * cgrp , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													struct  cgroup_subsys_state  * template [ ] ) 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_root  * root  =  cgrp - > root ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  css_set  * cset ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-01-10 11:49:27 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									unsigned  long  key ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-24 15:21:48 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  i ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroups: revamp subsys array
This patch series provides the ability for cgroup subsystems to be
compiled as modules both within and outside the kernel tree.  This is
mainly useful for classifiers and subsystems that hook into components
that are already modules.  cls_cgroup and blkio-cgroup serve as the
example use cases for this feature.
It provides an interface cgroup_load_subsys() and cgroup_unload_subsys()
which modular subsystems can use to register and depart during runtime.
The net_cls classifier subsystem serves as the example for a subsystem
which can be converted into a module using these changes.
Patch #1 sets up the subsys[] array so its contents can be dynamic as
modules appear and (eventually) disappear.  Iterations over the array are
modified to handle when subsystems are absent, and the dynamic section of
the array is protected by cgroup_mutex.
Patch #2 implements an interface for modules to load subsystems, called
cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module
pointer in struct cgroup_subsys.
Patch #3 adds a mechanism for unloading modular subsystems, which includes
a more advanced rework of the rudimentary reference counting introduced in
patch 2.
Patch #4 modifies the net_cls subsystem, which already had some module
declarations, to be configurable as a module, which also serves as a
simple proof-of-concept.
Part of implementing patches 2 and 4 involved updating css pointers in
each css_set when the module appears or leaves.  In doing this, it was
discovered that css_sets always remain linked to the dummy cgroup,
regardless of whether or not any subsystems are actually bound to it
(i.e., not mounted on an actual hierarchy).  The subsystem loading and
unloading code therefore should keep in mind the special cases where the
added subsystem is the only one in the dummy cgroup (and therefore all
css_sets need to be linked back into it) and where the removed subsys was
the only one in the dummy cgroup (and therefore all css_sets should be
unlinked from it) - however, as all css_sets always stay attached to the
dummy cgroup anyway, these cases are ignored.  Any fix that addresses this
issue should also make sure these cases are addressed in the subsystem
loading and unloading code.
This patch:
Make subsys[] able to be dynamically populated to support modular
subsystems
This patch reworks the way the subsys[] array is used so that subsystems
can register themselves after boot time, and enables the internals of
cgroups to be able to handle when subsystems are not present or may
appear/disappear.
Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2010-03-10 15:22:07 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Build  the  set  of  subsystem  state  objects  that  we  want  to  see  in  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  new  css_set .  while  subsystems  can  change  globally ,  the  entries  here 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  won ' t  change ,  so  no  need  for  locking . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for_each_subsys ( ss ,  i )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( root - > subsys_mask  &  ( 1UL  < <  i ) )  { 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css.  This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups.  compare_css_sets()
already compares both, not for correctness but for optimization.  As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 *  @ ss  is  in  this  hierarchy ,  so  we  want  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 *  effective  css  from  @ cgrp . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											template [ i ]  =  cgroup_e_css ( cgrp ,  ss ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										}  else  { 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css.  This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups.  compare_css_sets()
already compares both, not for correctness but for optimization.  As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 *  @ ss  is  not  in  this  hierarchy ,  so  we  don ' t  want 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 *  to  change  the  css . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											template [ i ]  =  old_cset - > subsys [ i ] ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-01-10 11:49:27 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									key  =  css_set_hash ( template ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									hash_for_each_possible ( css_set_table ,  cset ,  hlist ,  key )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! compare_css_sets ( cset ,  old_cset ,  cgrp ,  template ) ) 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/* This css_set matches what we need */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  cset ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* No existing cgroup group matched */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  NULL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  free_cgrp_cset_links ( struct  list_head  * links_to_free )  
						 
					
						
							
								
									
										
										
										
											2008-07-29 22:33:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgrp_cset_link  * link ,  * tmp_link ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-07-29 22:33:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									list_for_each_entry_safe ( link ,  tmp_link ,  links_to_free ,  cset_link )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										list_del ( & link - > cset_link ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-07-29 22:33:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										kfree ( link ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  allocate_cgrp_cset_links  -  allocate  cgrp_cset_links 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ count :  the  number  of  links  to  allocate 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ tmp_links :  list_head  the  allocated  links  are  put  on 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Allocate  @ count  cgrp_cset_link  structures  and  chain  them  on  @ tmp_links 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  through  - > cset_link .   Returns  0  on  success  or  - errno . 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  allocate_cgrp_cset_links ( int  count ,  struct  list_head  * tmp_links )  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgrp_cset_link  * link ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  i ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									INIT_LIST_HEAD ( tmp_links ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for  ( i  =  0 ;  i  <  count ;  i + + )  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										link  =  kzalloc ( sizeof ( * link ) ,  GFP_KERNEL ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ! link )  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											free_cgrp_cset_links ( tmp_links ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											return  - ENOMEM ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										list_add ( & link - > cset_link ,  tmp_links ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-01-07 18:07:42 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  link_css_set  -  a  helper  function  to  link  a  css_set  to  a  cgroup 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ tmp_links :  cgrp_cset_link  objects  allocated  by  allocate_cgrp_cset_links ( ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ cset :  the  css_set  to  be  linked 
							 
						 
					
						
							
								
									
										
										
										
											2009-01-07 18:07:42 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ cgrp :  the  destination  cgroup 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  link_css_set ( struct  list_head  * tmp_links ,  struct  css_set  * cset ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 struct  cgroup  * cgrp ) 
							 
						 
					
						
							
								
									
										
										
										
											2009-01-07 18:07:42 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgrp_cset_link  * link ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-01-07 18:07:42 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									BUG_ON ( list_empty ( tmp_links ) ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( cgroup_on_dfl ( cgrp ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cset - > dfl_cgrp  =  cgrp ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									link  =  list_first_entry ( tmp_links ,  struct  cgrp_cset_link ,  cset_link ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									link - > cset  =  cset ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									link - > cgrp  =  cgrp ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( list_empty ( & cgrp - > cset_links ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgroup_update_populated ( cgrp ,  true ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									list_move ( & link - > cset_link ,  & cgrp - > cset_links ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Always  add  links  to  the  tail  of  the  list  so  that  the  list 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  is  sorted  by  order  of  hierarchy  creation 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									list_add_tail ( & link - > cgrp_link ,  & cset - > cgrp_links ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-01-07 18:07:42 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-24 15:21:48 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  find_css_set  -  return  a  new  css_set  with  one  cgroup  updated 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ old_cset :  the  baseline  css_set 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cgrp :  the  cgroup  to  be  updated 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Return  a  new  css_set  that ' s  equivalent  to  @ old_cset ,  but  with  @ cgrp 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  substituted  into  the  appropriate  hierarchy . 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  struct  css_set  * find_css_set ( struct  css_set  * old_cset ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												    struct  cgroup  * cgrp ) 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-06-24 15:21:48 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * template [ CGROUP_SUBSYS_COUNT ]  =  {  } ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  css_set  * cset ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  list_head  tmp_links ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgrp_cset_link  * link ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-01-10 11:49:27 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									unsigned  long  key ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  ssid ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-24 15:21:48 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* First see if we already have a cgroup group that matches
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  the  desired  set  */ 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									down_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cset  =  find_existing_css_set ( old_cset ,  cgrp ,  template ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( cset ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										get_css_set ( cset ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									up_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( cset ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  cset ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cset  =  kzalloc ( sizeof ( * cset ) ,  GFP_KERNEL ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! cset ) 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  NULL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* Allocate all the cgrp_cset_link objects that we'll need */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-24 15:21:47 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( allocate_cgrp_cset_links ( cgroup_root_count ,  & tmp_links )  <  0 )  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										kfree ( cset ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  NULL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									atomic_set ( & cset - > refcount ,  1 ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									INIT_LIST_HEAD ( & cset - > cgrp_links ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									INIT_LIST_HEAD ( & cset - > tasks ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add css_set->mg_tasks
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, we're going to use task->cg_list during
migration too.  Instead of building a separate array, target tasks
will be linked into a dedicated migration list_head on the owning
css_set.  Tasks on the migration list are treated the same as tasks on
the usual tasks list; however, being on a separate list allows cgroup
migration code path to keep track of the target tasks by simply
keeping the list of css_sets with tasks being migrated, making
unpredictable dynamic allocation unnecessary.
In prepartion of such migration path update, this patch introduces
css_set->mg_tasks list and updates css_set task iterations so that
they walk both css_set->tasks and ->mg_tasks.  Note that ->mg_tasks
isn't used yet.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									INIT_LIST_HEAD ( & cset - > mg_tasks ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									INIT_LIST_HEAD ( & cset - > mg_preload_node ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									INIT_LIST_HEAD ( & cset - > mg_node ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									INIT_HLIST_NODE ( & cset - > hlist ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* Copy the set of subsystem state objects generated in
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  find_existing_css_set ( )  */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									memcpy ( cset - > subsys ,  template ,  sizeof ( cset - > subsys ) ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									down_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* Add reference counts and links from the new css_set. */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									list_for_each_entry ( link ,  & old_cset - > cgrp_links ,  cgrp_link )  { 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										struct  cgroup  * c  =  link - > cgrp ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( c - > root  = =  cgrp - > root ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											c  =  cgrp ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										link_css_set ( & tmp_links ,  cset ,  c ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									BUG_ON ( ! list_empty ( & tmp_links ) ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									css_set_count + + ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* Add @cset to the hash table */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									key  =  css_set_hash ( cset - > subsys ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									hash_add ( css_set_table ,  & cset - > hlist ,  key ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for_each_subsys ( ss ,  ssid ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										list_add_tail ( & cset - > e_cset_node [ ssid ] , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											      & cset - > subsys [ ssid ] - > cgroup - > e_csets [ ssid ] ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									up_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  cset ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  struct  cgroup_root  * cgroup_root_from_kf ( struct  kernfs_root  * kf_root )  
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * root_cgrp  =  kf_root - > kn - > priv ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  root_cgrp - > root ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_init_root_id ( struct  cgroup_root  * root )  
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									int  id ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									id  =  idr_alloc_cyclic ( & cgroup_hierarchy_idr ,  root ,  0 ,  0 ,  GFP_KERNEL ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( id  <  0 ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  id ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									root - > hierarchy_id  =  id ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  cgroup_exit_root_id ( struct  cgroup_root  * root )  
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( root - > hierarchy_id )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										idr_remove ( & cgroup_hierarchy_idr ,  root - > hierarchy_id ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										root - > hierarchy_id  =  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  cgroup_free_root ( struct  cgroup_root  * root )  
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( root )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/* hierarhcy ID shoulid already have been released */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										WARN_ON_ONCE ( root - > hierarchy_id ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										idr_destroy ( & root - > cgroup_idr ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										kfree ( root ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  cgroup_destroy_root ( struct  cgroup_root  * root )  
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp  =  & root - > cgrp ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgrp_cset_link  * link ,  * tmp_link ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_lock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									BUG_ON ( atomic_read ( & root - > nr_cgrps ) ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									BUG_ON ( ! list_empty ( & cgrp - > self . children ) ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* Rebind all subsystems back to the default hierarchy */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									rebind_subsystems ( & cgrp_dfl_root ,  root - > subsys_mask ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  Release  all  the  links  from  cset_links  to  this  hierarchy ' s 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  root  cgroup 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									down_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									list_for_each_entry_safe ( link ,  tmp_link ,  & cgrp - > cset_links ,  cset_link )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										list_del ( & link - > cset_link ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										list_del ( & link - > cgrp_link ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										kfree ( link ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									up_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! list_empty ( & root - > root_list ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										list_del ( & root - > root_list ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgroup_root_count - - ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgroup_exit_root_id ( root ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									mutex_unlock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									kernfs_destroy_root ( root - > kf_root ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_free_root ( root ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:02 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/* look up cgroup associated with given css_set on the specified hierarchy */  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  struct  cgroup  * cset_cgroup_from_root ( struct  css_set  * cset ,  
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
													    struct  cgroup_root  * root ) 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup  * res  =  NULL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									lockdep_assert_held ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( cset  = =  & init_css_set )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										res  =  & root - > cgrp ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									}  else  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										struct  cgrp_cset_link  * link ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										list_for_each_entry ( link ,  & cset - > cgrp_links ,  cgrp_link )  { 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											struct  cgroup  * c  =  link - > cgrp ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											if  ( c - > root  = =  root )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												res  =  c ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												break ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									BUG_ON ( ! res ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  res ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:02 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  Return  the  cgroup  for  " task "  from  the  given  hierarchy .  Must  be 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  called  with  cgroup_mutex  and  css_set_rwsem  held . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  struct  cgroup  * task_cgroup_from_root ( struct  task_struct  * task ,  
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
													    struct  cgroup_root  * root ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:02 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  No  need  to  lock  the  task  -  since  we  hold  cgroup_mutex  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  task  can ' t  change  groups ,  so  the  only  thing  that  can  happen 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  is  that  it  exits  and  its  css  is  set  back  to  init_css_set . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  cset_cgroup_from_root ( task_css_set ( task ) ,  root ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  A  task  must  hold  cgroup_mutex  to  modify  cgroups . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Any  task  can  increment  and  decrement  the  count  field  without  lock . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  So  in  general ,  code  holding  cgroup_mutex  can ' t  rely  on  the  count 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  field  not  changing .   However ,  if  the  count  goes  to  zero ,  then  only 
							 
						 
					
						
							
								
									
										
										
										
											2008-02-07 00:14:43 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  cgroup_attach_task ( )  can  increment  it  again .   Because  a  count  of  zero 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								 *  means  that  no  tasks  are  currently  attached ,  therefore  there  is  no 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  way  a  task  attached  to  that  cgroup  can  fork  ( the  other  way  to 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  increment  the  count ) .   So  code  holding  cgroup_mutex  can  safely 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  assume  that  if  the  count  is  zero ,  it  will  stay  zero .  Similarly ,  if 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  a  task  holds  cgroup_mutex  on  a  cgroup  with  zero  count ,  it 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  knows  that  the  cgroup  won ' t  be  removed ,  as  cgroup_rmdir ( ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  needs  that  mutex . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  The  fork  and  exit  callbacks  cgroup_fork ( )  and  cgroup_exit ( ) ,  don ' t 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  ( usually )  take  cgroup_mutex .   These  are  the  two  most  performance 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  critical  pieces  of  code  here .   The  exception  occurs  on  cgroup_exit ( ) , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  when  a  task  in  a  notify_on_release  cgroup  exits .   Then  cgroup_mutex 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  is  taken ,  and  if  the  cgroup  count  is  zero ,  a  usermode  call  made 
							 
						 
					
						
							
								
									
										
										
										
											2008-02-23 15:24:09 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  to  the  release  agent  with  the  name  of  the  cgroup  ( path  relative  to 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  the  root  of  cgroup  file  system )  as  the  argument . 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  A  cgroup  can  only  be  deleted  if  both  its  ' count '  of  using  tasks 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  is  zero ,  and  its  list  of  ' children '  cgroups  is  empty .   Since  all 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  tasks  in  the  system  use  _some_  cgroup ,  and  since  there  is  always  at 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  least  one  task  in  the  system  ( init ,  pid  = =  1 ) ,  therefore ,  root  cgroup 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								 *  always  has  either  children  cgroups  and / or  using  tasks .   So  we  don ' t 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  need  a  special  hack  to  ensure  that  root  cgroup  cannot  be  deleted . 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  P . S .   One  more  locking  exception .   RCU  is  used  to  guard  the 
							 
						 
					
						
							
								
									
										
										
										
											2008-02-07 00:14:43 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  update  of  a  tasks  cgroup  pointer  by  cgroup_attach_task ( ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_populate_dir ( struct  cgroup  * cgrp ,  unsigned  int  subsys_mask ) ;  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  struct  kernfs_syscall_ops  cgroup_kf_syscall_ops ;  
						 
					
						
							
								
									
										
										
										
											2009-10-01 15:43:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  const  struct  file_operations  proc_cgroupstats_operations ;  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  char  * cgroup_file_name ( struct  cgroup  * cgrp ,  const  struct  cftype  * cft ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											      char  * buf ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( cft - > ss  & &  ! ( cft - > flags  &  CFTYPE_NO_PREFIX )  & & 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									    ! ( cgrp - > root - > flags  &  CGRP_ROOT_NOPREFIX ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										snprintf ( buf ,  CGROUP_FILE_NAME_MAX ,  " %s.%s " , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 cft - > ss - > name ,  cft - > name ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									else 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										strncpy ( buf ,  cft - > name ,  CGROUP_FILE_NAME_MAX ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  buf ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_file_mode  -  deduce  file  mode  of  a  control  file 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cft :  the  control  file  in  question 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  returns  cft - > mode  if  - > mode  is  not  0 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  returns  S_IRUGO | S_IWUSR  if  it  has  both  a  read  and  a  write  handler 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  returns  S_IRUGO  if  it  has  only  a  read  handler 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  returns  S_IWUSR  if  it  has  only  a  write  hander 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  umode_t  cgroup_file_mode ( const  struct  cftype  * cft )  
						 
					
						
							
								
									
										
										
										
											2013-03-01 15:01:56 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									umode_t  mode  =  0 ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-03-01 15:01:56 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( cft - > mode ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  cft - > mode ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( cft - > read_u64  | |  cft - > read_s64  | |  cft - > seq_show ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										mode  | =  S_IRUGO ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:21 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( cft - > write_u64  | |  cft - > write_s64  | |  cft - > write ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										mode  | =  S_IWUSR ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  mode ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-03-01 15:01:56 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  cgroup_get ( struct  cgroup  * cgrp )  
						 
					
						
							
								
									
										
										
										
											2013-01-24 14:31:42 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									WARN_ON_ONCE ( cgroup_is_dead ( cgrp ) ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									css_get ( & cgrp - > self ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-01-24 14:31:42 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  cgroup_put ( struct  cgroup  * cgrp )  
						 
					
						
							
								
									
										
										
										
											2013-01-24 14:31:42 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									css_put ( & cgrp - > self ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-01-24 14:31:42 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_kn_unlock  -  unlocking  helper  for  cgroup  kernfs  methods 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ kn :  the  kernfs_node  being  serviced 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  This  helper  undoes  cgroup_kn_lock_live ( )  and  should  be  invoked  before 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  the  method  finishes  if  locking  succeeded .   Note  that  once  this  function 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  returns  the  cgroup  returned  by  cgroup_kn_lock_live ( )  may  become 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  inaccessible  any  time .   If  the  caller  intends  to  continue  to  access  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup ,  it  should  pin  it  before  invoking  this  function . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  void  cgroup_kn_unlock ( struct  kernfs_node  * kn )  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( kernfs_type ( kn )  = =  KERNFS_DIR ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgrp  =  kn - > priv ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									else 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgrp  =  kn - > parent - > priv ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									mutex_unlock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									kernfs_unbreak_active_protection ( kn ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgroup_put ( cgrp ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_kn_lock_live  -  locking  helper  for  cgroup  kernfs  methods 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ kn :  the  kernfs_node  being  serviced 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  This  helper  is  to  be  used  by  a  cgroup  kernfs  method  currently  servicing 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ kn .   It  breaks  the  active  protection ,  performs  cgroup  locking  and 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  verifies  that  the  associated  cgroup  is  alive .   Returns  the  cgroup  if 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  alive ;  otherwise ,  % NULL .   A  successful  return  should  be  undone  by  a 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  matching  cgroup_kn_unlock ( )  invocation . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Any  cgroup  kernfs  method  implementation  which  requires  locking  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  associated  cgroup  should  use  this  helper .   It  avoids  nesting  cgroup 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  locking  under  kernfs  active  protection  and  allows  all  kernfs  operations 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  including  self - removal . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  struct  cgroup  * cgroup_kn_lock_live ( struct  kernfs_node  * kn )  
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( kernfs_type ( kn )  = =  KERNFS_DIR ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgrp  =  kn - > priv ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									else 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgrp  =  kn - > parent - > priv ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-01-21 18:18:33 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  We ' re  gonna  grab  cgroup_mutex  which  nests  outside  kernfs 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  active_ref .   cgroup  liveliness  check  alone  provides  enough 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  protection  against  removal .   Ensure  @ cgrp  stays  accessible  and 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  break  the  active_ref  protection . 
							 
						 
					
						
							
								
									
										
										
										
											2013-01-21 18:18:33 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_get ( cgrp ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									kernfs_break_active_protection ( kn ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_lock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! cgroup_is_dead ( cgrp ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  cgrp ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgroup_kn_unlock ( kn ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  NULL ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-01-21 18:18:33 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  cgroup_rm_file ( struct  cgroup  * cgrp ,  const  struct  cftype  * cft )  
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									char  name [ CGROUP_FILE_NAME_MAX ] ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									kernfs_remove_by_name ( cgrp - > kn ,  cgroup_file_name ( cgrp ,  cft ,  name ) ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-08-23 16:53:29 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
										
										
											2013-06-28 16:24:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  cgroup_clear_dir  -  remove  subsys  files  in  a  cgroup  directory 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 16:24:10 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ cgrp :  target  cgroup 
							 
						 
					
						
							
								
									
										
										
										
											2012-08-23 16:53:29 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ subsys_mask :  mask  of  the  subsystem  ids  whose  files  should  be  removed 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  cgroup_clear_dir ( struct  cgroup  * cgrp ,  unsigned  int  subsys_mask )  
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2012-08-23 16:53:29 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-07-12 12:34:02 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  i ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-07-12 12:34:02 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for_each_subsys ( ss ,  i )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										struct  cftype  * cfts ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-07-12 12:34:02 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ! ( subsys_mask  &  ( 1  < <  i ) ) ) 
							 
						 
					
						
							
								
									
										
										
										
											2012-08-23 16:53:29 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										list_for_each_entry ( cfts ,  & ss - > cfts ,  node ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											cgroup_addrm_files ( cgrp ,  cfts ,  false ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-08-23 16:53:29 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  rebind_subsystems ( struct  cgroup_root  * dst_root ,  unsigned  int  ss_mask )  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 19:33:07 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									unsigned  int  tmp_ss_mask ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  ssid ,  i ,  ret ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce cgroup_tree_mutex
Currently cgroup uses combination of inode->i_mutex'es and
cgroup_mutex for synchronization.  With the scheduled kernfs
conversion, i_mutex'es will be removed.  Unfortunately, just using
cgroup_mutex isn't possible.  All kernfs file and syscall operations,
most of which require grabbing cgroup_mutex, will be called with
kernfs active ref held and, if we try to perform kernfs removals under
cgroup_mutex, it can deadlock as kernfs_remove() tries to drain the
target node.
Let's introduce a new outer mutex, cgroup_tree_mutex, which protects
stuff used during hierarchy changing operations - cftypes and all the
operations which may affect the cgroupfs.  It also covers css
association and iteration.  This allows cgroup_css(), for_each_css()
and other css iterators to be called under cgroup_tree_mutex.  The new
mutex will nest above both kernfs's active ref protection and
cgroup_mutex.  By protecting tree modifications with a separate outer
mutex, we can get rid of the forementioned deadlock condition.
Actual file additions and removals now require cgroup_tree_mutex
instead of cgroup_mutex.  Currently, cgroup_tree_mutex is never used
without cgroup_mutex; however, we'll soon add hierarchy modification
sections which are only protected by cgroup_tree_mutex.  In the
future, we might want to make the locking more granular by better
splitting the coverages of the two mutexes.  For now, this should do.
v2: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-11 11:52:47 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for_each_subsys ( ss ,  ssid )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! ( ss_mask  &  ( 1  < <  ssid ) ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroups: revamp subsys array
This patch series provides the ability for cgroup subsystems to be
compiled as modules both within and outside the kernel tree.  This is
mainly useful for classifiers and subsystems that hook into components
that are already modules.  cls_cgroup and blkio-cgroup serve as the
example use cases for this feature.
It provides an interface cgroup_load_subsys() and cgroup_unload_subsys()
which modular subsystems can use to register and depart during runtime.
The net_cls classifier subsystem serves as the example for a subsystem
which can be converted into a module using these changes.
Patch #1 sets up the subsys[] array so its contents can be dynamic as
modules appear and (eventually) disappear.  Iterations over the array are
modified to handle when subsystems are absent, and the dynamic section of
the array is protected by cgroup_mutex.
Patch #2 implements an interface for modules to load subsystems, called
cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module
pointer in struct cgroup_subsys.
Patch #3 adds a mechanism for unloading modular subsystems, which includes
a more advanced rework of the rudimentary reference counting introduced in
patch 2.
Patch #4 modifies the net_cls subsystem, which already had some module
declarations, to be configurable as a module, which also serves as a
simple proof-of-concept.
Part of implementing patches 2 and 4 involved updating css pointers in
each css_set when the module appears or leaves.  In doing this, it was
discovered that css_sets always remain linked to the dummy cgroup,
regardless of whether or not any subsystems are actually bound to it
(i.e., not mounted on an actual hierarchy).  The subsystem loading and
unloading code therefore should keep in mind the special cases where the
added subsystem is the only one in the dummy cgroup (and therefore all
css_sets need to be linked back into it) and where the removed subsys was
the only one in the dummy cgroup (and therefore all css_sets should be
unlinked from it) - however, as all css_sets always stay attached to the
dummy cgroup anyway, these cases are ignored.  Any fix that addresses this
issue should also make sure these cases are addressed in the subsystem
loading and unloading code.
This patch:
Make subsys[] able to be dynamically populated to support modular
subsystems
This patch reworks the way the subsys[] array is used so that subsystems
can register themselves after boot time, and enables the internals of
cgroups to be able to handle when subsystems are not present or may
appear/disappear.
Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2010-03-10 15:22:07 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										/* if @ss has non-root csses attached to it, can't move */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( css_next_child ( NULL ,  cgroup_css ( & ss - > root - > cgrp ,  ss ) ) ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											return  - EBUSY ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-07-12 13:38:17 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										/* can't move between two non-dummy roots either */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ss - > root  ! =  & cgrp_dfl_root  & &  dst_root  ! =  & cgrp_dfl_root ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											return  - EBUSY ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 19:33:07 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* skip creating root files on dfl_root for inhibited subsystems */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									tmp_ss_mask  =  ss_mask ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( dst_root  = =  & cgrp_dfl_root ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										tmp_ss_mask  & =  ~ cgrp_dfl_root_inhibit_ss_mask ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									ret  =  cgroup_populate_dir ( & dst_root - > cgrp ,  tmp_ss_mask ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:55 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ret )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( dst_root  ! =  & cgrp_dfl_root ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											return  ret ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:55 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  Rebinding  back  to  the  default  root  is  not  allowed  to 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  fail .   Using  both  default  and  non - default  roots  should 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  be  rare .   Moving  subsystems  back  and  forth  even  more  so . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  Just  warn  about  it  and  continue . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( cgrp_dfl_root_visible )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											pr_warn ( " failed to create files (%d) while rebinding 0x%x to default root \n " , 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:03 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												ret ,  ss_mask ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:03 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											pr_warn ( " you may retry by moving them to a different hierarchy and unbinding \n " ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:55 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 17:07:30 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Nothing  can  fail  from  this  point  on .   Remove  files  for  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  removed  subsystems  and  rebind  each  subsystem . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for_each_subsys ( ss ,  ssid ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:55 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ss_mask  &  ( 1  < <  ssid ) ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											cgroup_clear_dir ( & ss - > root - > cgrp ,  1  < <  ssid ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-24 15:21:47 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for_each_subsys ( ss ,  ssid )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										struct  cgroup_root  * src_root ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										struct  cgroup_subsys_state  * css ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										struct  css_set  * cset ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-24 15:21:47 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ! ( ss_mask  &  ( 1  < <  ssid ) ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-24 15:21:47 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										src_root  =  ss - > root ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										css  =  cgroup_css ( & src_root - > cgrp ,  ss ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-24 15:21:47 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										WARN_ON ( ! css  | |  cgroup_css ( & dst_root - > cgrp ,  ss ) ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 11:01:55 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										RCU_INIT_POINTER ( src_root - > cgrp . subsys [ ssid ] ,  NULL ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										rcu_assign_pointer ( dst_root - > cgrp . subsys [ ssid ] ,  css ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										ss - > root  =  dst_root ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										css - > cgroup  =  & dst_root - > cgrp ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 11:01:55 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										down_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										hash_for_each ( css_set_table ,  i ,  cset ,  hlist ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											list_move_tail ( & cset - > e_cset_node [ ss - > id ] , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												       & dst_root - > cgrp . e_csets [ ss - > id ] ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										up_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										src_root - > subsys_mask  & =  ~ ( 1  < <  ssid ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										src_root - > cgrp . child_subsys_mask  & =  ~ ( 1  < <  ssid ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										/* default hierarchy doesn't enable controllers by default */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										dst_root - > subsys_mask  | =  1  < <  ssid ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( dst_root  ! =  & cgrp_dfl_root ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											dst_root - > cgrp . child_subsys_mask  | =  1  < <  ssid ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-24 15:21:47 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ss - > bind ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											ss - > bind ( css ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:55 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									kernfs_activate ( dst_root - > cgrp . kn ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_show_options ( struct  seq_file  * seq ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											       struct  kernfs_root  * kf_root ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_root  * root  =  cgroup_root_from_kf ( kf_root ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:57 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  ssid ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:57 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for_each_subsys ( ss ,  ssid ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( root - > subsys_mask  &  ( 1  < <  ssid ) ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:57 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											seq_printf ( seq ,  " ,%s " ,  ss - > name ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce sane_behavior mount option
It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.
As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility.  This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors.  As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.
Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default.  Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.
This patch introduces the mount option and changes the following
behaviors in cgroup core.
* Mount options "noprefix" and "clone_children" are disallowed.  Also,
  cgroupfs file cgroup.clone_children is not created.
* When mounting an existing superblock, mount options should match.
  This is currently pretty crazy.  If one mounts a cgroup, creates a
  subdirectory, unmounts it and then mount it again with different
  option, it looks like the new options are applied but they aren't.
* Remount is disallowed.
The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.
v2: Dropped unnecessary explicit file permission setting sane_behavior
    cftype entry as suggested by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
											 
										 
										
											2013-04-14 20:15:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( root - > flags  &  CGRP_ROOT_SANE_BEHAVIOR ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										seq_puts ( seq ,  " ,sane_behavior " ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-14 20:15:25 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( root - > flags  &  CGRP_ROOT_NOPREFIX ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
										seq_puts ( seq ,  " ,noprefix " ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-14 20:15:25 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( root - > flags  &  CGRP_ROOT_XATTR ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add xattr support
This is one of the items in the plumber's wish list.
For use cases:
>> What would the use case be for this?
>
> Attaching meta information to services, in an easily discoverable
> way. For example, in systemd we create one cgroup for each service, and
> could then store data like the main pid of the specific service as an
> xattr on the cgroup itself. That way we'd have almost all service state
> in the cgroupfs, which would make it possible to terminate systemd and
> later restart it without losing any state information. But there's more:
> for example, some very peculiar services cannot be terminated on
> shutdown (i.e. fakeraid DM stuff) and it would be really nice if the
> services in question could just mark that on their cgroup, by setting an
> xattr. On the more desktopy side of things there are other
> possibilities: for example there are plans defining what an application
> is along the lines of a cgroup (i.e. an app being a collection of
> processes). With xattrs one could then attach an icon or human readable
> program name on the cgroup.
>
> The key idea is that this would allow attaching runtime meta information
> to cgroups and everything they model (services, apps, vms), that doesn't
> need any complex userspace infrastructure, has good access control
> (i.e. because the file system enforces that anyway, and there's the
> "trusted." xattr namespace), notifications (inotify), and can easily be
> shared among applications.
>
> Lennart
v7:
- no changes
v6:
- remove user xattr namespace, only allow trusted and security
v5:
- check for capabilities before setting/removing xattrs
v4:
- no changes
v3:
- instead of config option, use mount option to enable xattr support
Original-patch-by: Li Zefan <lizefan@huawei.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
											 
										 
										
											2012-08-23 16:53:30 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										seq_puts ( seq ,  " ,xattr " ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									spin_lock ( & release_agent_path_lock ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( strlen ( root - > release_agent_path ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										seq_printf ( seq ,  " ,release_agent=%s " ,  root - > release_agent_path ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									spin_unlock ( & release_agent_path_lock ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( test_bit ( CGRP_CPUSET_CLONE_CHILDREN ,  & root - > cgrp . flags ) ) 
							 
						 
					
						
							
								
									
										
										
										
											2010-10-27 15:33:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										seq_puts ( seq ,  " ,clone_children " ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( strlen ( root - > name ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										seq_printf ( seq ,  " ,name=%s " ,  root - > name ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								struct  cgroup_sb_opts  {  
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									unsigned  int  subsys_mask ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									unsigned  int  flags ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									char  * release_agent ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:38 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									bool  cpuset_clone_children ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									char  * name ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:23 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* User explicitly requested empty subsystem */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									bool  none ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								} ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2010-03-10 15:22:09 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  parse_cgroupfs_options ( char  * data ,  struct  cgroup_sb_opts  * opts )  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2010-10-27 15:33:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									char  * token ,  * o  =  data ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									bool  all_ss  =  false ,  one_ss  =  false ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									unsigned  int  mask  =  - 1U ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									int  i ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-06-17 16:26:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# ifdef CONFIG_CPUSETS 
  
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mask  =  ~ ( 1U  < <  cpuset_cgrp_id ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-06-17 16:26:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# endif 
  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									memset ( opts ,  0 ,  sizeof ( * opts ) ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									while  ( ( token  =  strsep ( & o ,  " , " ) )  ! =  NULL )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! * token ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											return  - EINVAL ; 
							 
						 
					
						
							
								
									
										
										
										
											2010-10-27 15:33:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ! strcmp ( token ,  " none " ) )  { 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:23 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											/* Explicitly have no subsystems */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											opts - > none  =  true ; 
							 
						 
					
						
							
								
									
										
										
										
											2010-10-27 15:33:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! strcmp ( token ,  " all " ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											/* Mutually exclusive option 'all' + subsystem name */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( one_ss ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												return  - EINVAL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											all_ss  =  true ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce sane_behavior mount option
It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.
As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility.  This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors.  As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.
Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default.  Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.
This patch introduces the mount option and changes the following
behaviors in cgroup core.
* Mount options "noprefix" and "clone_children" are disallowed.  Also,
  cgroupfs file cgroup.clone_children is not created.
* When mounting an existing superblock, mount options should match.
  This is currently pretty crazy.  If one mounts a cgroup, creates a
  subdirectory, unmounts it and then mount it again with different
  option, it looks like the new options are applied but they aren't.
* Remount is disallowed.
The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.
v2: Dropped unnecessary explicit file permission setting sane_behavior
    cftype entry as suggested by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
											 
										 
										
											2013-04-14 20:15:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ! strcmp ( token ,  " __DEVEL__sane_behavior " ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											opts - > flags  | =  CGRP_ROOT_SANE_BEHAVIOR ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
										
										
											2010-10-27 15:33:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ! strcmp ( token ,  " noprefix " ) )  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-14 20:15:25 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											opts - > flags  | =  CGRP_ROOT_NOPREFIX ; 
							 
						 
					
						
							
								
									
										
										
										
											2010-10-27 15:33:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! strcmp ( token ,  " clone_children " ) )  { 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:38 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											opts - > cpuset_clone_children  =  true ; 
							 
						 
					
						
							
								
									
										
										
										
											2010-10-27 15:33:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add xattr support
This is one of the items in the plumber's wish list.
For use cases:
>> What would the use case be for this?
>
> Attaching meta information to services, in an easily discoverable
> way. For example, in systemd we create one cgroup for each service, and
> could then store data like the main pid of the specific service as an
> xattr on the cgroup itself. That way we'd have almost all service state
> in the cgroupfs, which would make it possible to terminate systemd and
> later restart it without losing any state information. But there's more:
> for example, some very peculiar services cannot be terminated on
> shutdown (i.e. fakeraid DM stuff) and it would be really nice if the
> services in question could just mark that on their cgroup, by setting an
> xattr. On the more desktopy side of things there are other
> possibilities: for example there are plans defining what an application
> is along the lines of a cgroup (i.e. an app being a collection of
> processes). With xattrs one could then attach an icon or human readable
> program name on the cgroup.
>
> The key idea is that this would allow attaching runtime meta information
> to cgroups and everything they model (services, apps, vms), that doesn't
> need any complex userspace infrastructure, has good access control
> (i.e. because the file system enforces that anyway, and there's the
> "trusted." xattr namespace), notifications (inotify), and can easily be
> shared among applications.
>
> Lennart
v7:
- no changes
v6:
- remove user xattr namespace, only allow trusted and security
v5:
- check for capabilities before setting/removing xattrs
v4:
- no changes
v3:
- instead of config option, use mount option to enable xattr support
Original-patch-by: Li Zefan <lizefan@huawei.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
											 
										 
										
											2012-08-23 16:53:30 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ! strcmp ( token ,  " xattr " ) )  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-14 20:15:25 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											opts - > flags  | =  CGRP_ROOT_XATTR ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add xattr support
This is one of the items in the plumber's wish list.
For use cases:
>> What would the use case be for this?
>
> Attaching meta information to services, in an easily discoverable
> way. For example, in systemd we create one cgroup for each service, and
> could then store data like the main pid of the specific service as an
> xattr on the cgroup itself. That way we'd have almost all service state
> in the cgroupfs, which would make it possible to terminate systemd and
> later restart it without losing any state information. But there's more:
> for example, some very peculiar services cannot be terminated on
> shutdown (i.e. fakeraid DM stuff) and it would be really nice if the
> services in question could just mark that on their cgroup, by setting an
> xattr. On the more desktopy side of things there are other
> possibilities: for example there are plans defining what an application
> is along the lines of a cgroup (i.e. an app being a collection of
> processes). With xattrs one could then attach an icon or human readable
> program name on the cgroup.
>
> The key idea is that this would allow attaching runtime meta information
> to cgroups and everything they model (services, apps, vms), that doesn't
> need any complex userspace infrastructure, has good access control
> (i.e. because the file system enforces that anyway, and there's the
> "trusted." xattr namespace), notifications (inotify), and can easily be
> shared among applications.
>
> Lennart
v7:
- no changes
v6:
- remove user xattr namespace, only allow trusted and security
v5:
- check for capabilities before setting/removing xattrs
v4:
- no changes
v3:
- instead of config option, use mount option to enable xattr support
Original-patch-by: Li Zefan <lizefan@huawei.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
											 
										 
										
											2012-08-23 16:53:30 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
										
										
											2010-10-27 15:33:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ! strncmp ( token ,  " release_agent= " ,  14 ) )  { 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											/* Specifying two release agents is forbidden */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( opts - > release_agent ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												return  - EINVAL ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											opts - > release_agent  = 
							 
						 
					
						
							
								
									
										
										
										
											2010-08-10 18:02:54 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												kstrndup ( token  +  14 ,  PATH_MAX  -  1 ,  GFP_KERNEL ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											if  ( ! opts - > release_agent ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												return  - ENOMEM ; 
							 
						 
					
						
							
								
									
										
										
										
											2010-10-27 15:33:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! strncmp ( token ,  " name= " ,  5 ) )  { 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											const  char  * name  =  token  +  5 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											/* Can't specify an empty name */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( ! strlen ( name ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												return  - EINVAL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											/* Must match [\w.-]+ */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											for  ( i  =  0 ;  i  <  strlen ( name ) ;  i + + )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												char  c  =  name [ i ] ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												if  ( isalnum ( c ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												if  ( ( c  = =  ' . ' )  | |  ( c  = =  ' - ' )  | |  ( c  = =  ' _ ' ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												return  - EINVAL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											/* Specifying two names is forbidden */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( opts - > name ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												return  - EINVAL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											opts - > name  =  kstrndup ( name , 
							 
						 
					
						
							
								
									
										
										
										
											2010-08-10 18:02:54 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
													      MAX_CGROUP_ROOT_NAMELEN  -  1 , 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
													      GFP_KERNEL ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( ! opts - > name ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												return  - ENOMEM ; 
							 
						 
					
						
							
								
									
										
										
										
											2010-10-27 15:33:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										for_each_subsys ( ss ,  i )  { 
							 
						 
					
						
							
								
									
										
										
										
											2010-10-27 15:33:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											if  ( strcmp ( token ,  ss - > name ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( ss - > disabled ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											/* Mutually exclusive option 'all' + subsystem name */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( all_ss ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												return  - EINVAL ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											opts - > subsys_mask  | =  ( 1  < <  i ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2010-10-27 15:33:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											one_ss  =  true ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											break ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( i  = =  CGROUP_SUBSYS_COUNT ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											return  - ENOENT ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:23 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* Consistency checks */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce sane_behavior mount option
It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.
As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility.  This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors.  As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.
Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default.  Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.
This patch introduces the mount option and changes the following
behaviors in cgroup core.
* Mount options "noprefix" and "clone_children" are disallowed.  Also,
  cgroupfs file cgroup.clone_children is not created.
* When mounting an existing superblock, mount options should match.
  This is currently pretty crazy.  If one mounts a cgroup, creates a
  subdirectory, unmounts it and then mount it again with different
  option, it looks like the new options are applied but they aren't.
* Remount is disallowed.
The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.
v2: Dropped unnecessary explicit file permission setting sane_behavior
    cftype entry as suggested by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
											 
										 
										
											2013-04-14 20:15:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( opts - > flags  &  CGRP_ROOT_SANE_BEHAVIOR )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:03 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										pr_warn ( " sane_behavior: this is still under development and its behaviors will change, proceed at your own risk \n " ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce sane_behavior mount option
It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.
As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility.  This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors.  As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.
Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default.  Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.
This patch introduces the mount option and changes the following
behaviors in cgroup core.
* Mount options "noprefix" and "clone_children" are disallowed.  Also,
  cgroupfs file cgroup.clone_children is not created.
* When mounting an existing superblock, mount options should match.
  This is currently pretty crazy.  If one mounts a cgroup, creates a
  subdirectory, unmounts it and then mount it again with different
  option, it looks like the new options are applied but they aren't.
* Remount is disallowed.
The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.
v2: Dropped unnecessary explicit file permission setting sane_behavior
    cftype entry as suggested by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
											 
										 
										
											2013-04-14 20:15:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:38 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ( opts - > flags  &  ( CGRP_ROOT_NOPREFIX  |  CGRP_ROOT_XATTR ) )  | | 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										    opts - > cpuset_clone_children  | |  opts - > release_agent  | | 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										    opts - > name )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:03 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											pr_err ( " sane_behavior: noprefix, xattr, clone_children, release_agent and name are not allowed \n " ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce sane_behavior mount option
It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.
As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility.  This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors.  As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.
Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default.  Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.
This patch introduces the mount option and changes the following
behaviors in cgroup core.
* Mount options "noprefix" and "clone_children" are disallowed.  Also,
  cgroupfs file cgroup.clone_children is not created.
* When mounting an existing superblock, mount options should match.
  This is currently pretty crazy.  If one mounts a cgroup, creates a
  subdirectory, unmounts it and then mount it again with different
  option, it looks like the new options are applied but they aren't.
* Remount is disallowed.
The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.
v2: Dropped unnecessary explicit file permission setting sane_behavior
    cftype entry as suggested by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
											 
										 
										
											2013-04-14 20:15:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											return  - EINVAL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:55 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									}  else  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  If  the  ' all '  option  was  specified  select  all  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  subsystems ,  otherwise  if  ' none ' ,  ' name = '  and  a  subsystem 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  name  options  were  not  specified ,  let ' s  default  to  ' all ' 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( all_ss  | |  ( ! one_ss  & &  ! opts - > none  & &  ! opts - > name ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											for_each_subsys ( ss ,  i ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												if  ( ! ss - > disabled ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
													opts - > subsys_mask  | =  ( 1  < <  i ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce sane_behavior mount option
It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.
As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility.  This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors.  As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.
Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default.  Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.
This patch introduces the mount option and changes the following
behaviors in cgroup core.
* Mount options "noprefix" and "clone_children" are disallowed.  Also,
  cgroupfs file cgroup.clone_children is not created.
* When mounting an existing superblock, mount options should match.
  This is currently pretty crazy.  If one mounts a cgroup, creates a
  subdirectory, unmounts it and then mount it again with different
  option, it looks like the new options are applied but they aren't.
* Remount is disallowed.
The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.
v2: Dropped unnecessary explicit file permission setting sane_behavior
    cftype entry as suggested by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
											 
										 
										
											2013-04-14 20:15:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:55 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  We  either  have  to  specify  by  name  or  by  subsystems .  ( So 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  all  empty  hierarchies  must  have  a  name ) . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! opts - > subsys_mask  & &  ! opts - > name ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce sane_behavior mount option
It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.
As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility.  This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors.  As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.
Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default.  Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.
This patch introduces the mount option and changes the following
behaviors in cgroup core.
* Mount options "noprefix" and "clone_children" are disallowed.  Also,
  cgroupfs file cgroup.clone_children is not created.
* When mounting an existing superblock, mount options should match.
  This is currently pretty crazy.  If one mounts a cgroup, creates a
  subdirectory, unmounts it and then mount it again with different
  option, it looks like the new options are applied but they aren't.
* Remount is disallowed.
The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.
v2: Dropped unnecessary explicit file permission setting sane_behavior
    cftype entry as suggested by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
											 
										 
										
											2013-04-14 20:15:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											return  - EINVAL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-06-17 16:26:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Option  noprefix  was  introduced  just  for  backward  compatibility 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  with  the  old  cpuset ,  so  we  allow  noprefix  only  if  mounting  just 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  the  cpuset  subsystem . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-14 20:15:25 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ( opts - > flags  &  CGRP_ROOT_NOPREFIX )  & &  ( opts - > subsys_mask  &  mask ) ) 
							 
						 
					
						
							
								
									
										
										
										
											2009-06-17 16:26:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  - EINVAL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:23 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* Can't specify "none" and some subsystems */ 
							 
						 
					
						
							
								
									
										
										
										
											2012-08-23 16:53:31 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( opts - > subsys_mask  & &  opts - > none ) 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:23 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  - EINVAL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_remount ( struct  kernfs_root  * kf_root ,  int  * flags ,  char  * data )  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									int  ret  =  0 ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_root  * root  =  cgroup_root_from_kf ( kf_root ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									struct  cgroup_sb_opts  opts ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									unsigned  int  added_mask ,  removed_mask ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce sane_behavior mount option
It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.
As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility.  This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors.  As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.
Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default.  Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.
This patch introduces the mount option and changes the following
behaviors in cgroup core.
* Mount options "noprefix" and "clone_children" are disallowed.  Also,
  cgroupfs file cgroup.clone_children is not created.
* When mounting an existing superblock, mount options should match.
  This is currently pretty crazy.  If one mounts a cgroup, creates a
  subdirectory, unmounts it and then mount it again with different
  option, it looks like the new options are applied but they aren't.
* Remount is disallowed.
The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.
v2: Dropped unnecessary explicit file permission setting sane_behavior
    cftype entry as suggested by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
											 
										 
										
											2013-04-14 20:15:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( root - > flags  &  CGRP_ROOT_SANE_BEHAVIOR )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:03 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										pr_err ( " sane_behavior: remount is not allowed \n " ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce sane_behavior mount option
It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.
As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility.  This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors.  As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.
Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default.  Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.
This patch introduces the mount option and changes the following
behaviors in cgroup core.
* Mount options "noprefix" and "clone_children" are disallowed.  Also,
  cgroupfs file cgroup.clone_children is not created.
* When mounting an existing superblock, mount options should match.
  This is currently pretty crazy.  If one mounts a cgroup, creates a
  subdirectory, unmounts it and then mount it again with different
  option, it looks like the new options are applied but they aren't.
* Remount is disallowed.
The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.
v2: Dropped unnecessary explicit file permission setting sane_behavior
    cftype entry as suggested by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
											 
										 
										
											2013-04-14 20:15:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  - EINVAL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									mutex_lock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* See what subsystems are wanted */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									ret  =  parse_cgroupfs_options ( data ,  & opts ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ret ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										goto  out_unlock ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( opts . subsys_mask  ! =  root - > subsys_mask  | |  opts . release_agent ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:03 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										pr_warn ( " option changes via remount are deprecated (pid=%d comm=%s) \n " , 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:03 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											task_tgid_nr ( current ) ,  current - > comm ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:54 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									added_mask  =  opts . subsys_mask  &  ~ root - > subsys_mask ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									removed_mask  =  root - > subsys_mask  &  ~ opts . subsys_mask ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-08-23 16:53:29 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2010-03-10 15:22:09 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* Don't allow flags or name to change at remount */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-27 19:37:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ( ( opts . flags  ^  root - > flags )  &  CGRP_ROOT_OPTION_MASK )  | | 
							 
						 
					
						
							
								
									
										
										
										
											2010-03-10 15:22:09 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									    ( opts . name  & &  strcmp ( opts . name ,  root - > name ) ) )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										pr_err ( " option or name mismatch, new: 0x%x  \" %s \" , old: 0x%x  \" %s \" \n " , 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-27 19:37:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										       opts . flags  &  CGRP_ROOT_OPTION_MASK ,  opts . name  ? :  " " , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										       root - > flags  &  CGRP_ROOT_OPTION_MASK ,  root - > name ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										ret  =  - EINVAL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										goto  out_unlock ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 17:07:30 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* remounting is not allowed for populated hierarchies */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! list_empty ( & root - > cgrp . self . children ) )  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 17:07:30 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										ret  =  - EBUSY ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroups: don't change release_agent when remount failed
Remount can fail in either case:
  - wrong mount options is specified, or option 'noprefix' is changed.
  - a to-be-added subsys is already mounted/active.
When using remount to change 'release_agent', for the above former failure
case, remount will return errno with release_agent unchanged, but for the
latter case, remount will return EBUSY with relase_agent changed, which is
unexpected I think:
 # mount -t cgroup -o cpu xxx /cgrp1
 # mount -t cgroup -o cpuset,release_agent=agent1 yyy /cgrp2
 # cat /cgrp2/release_agent
 agent1
 # mount -t cgroup -o remount,cpuset,noprefix,release_agent=agent2 yyy /cgrp2
 mount: /cgrp2 not mounted already, or bad option
 # cat /cgrp2/release_agent
 agent1     <-- ok
 # mount -t cgroup -o remount,cpu,cpuset,release_agent=agent2 yyy /cgrp2
 mount: /cgrp2 is busy
 # cat /cgrp2/release_agent
 agent2     <-- unexpected!
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2009-04-02 16:57:30 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										goto  out_unlock ; 
							 
						 
					
						
							
								
									
										
										
										
											2010-03-10 15:22:09 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									ret  =  rebind_subsystems ( root ,  added_mask ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 17:07:30 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ret ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroups: don't change release_agent when remount failed
Remount can fail in either case:
  - wrong mount options is specified, or option 'noprefix' is changed.
  - a to-be-added subsys is already mounted/active.
When using remount to change 'release_agent', for the above former failure
case, remount will return errno with release_agent unchanged, but for the
latter case, remount will return EBUSY with relase_agent changed, which is
unexpected I think:
 # mount -t cgroup -o cpu xxx /cgrp1
 # mount -t cgroup -o cpuset,release_agent=agent1 yyy /cgrp2
 # cat /cgrp2/release_agent
 agent1
 # mount -t cgroup -o remount,cpuset,noprefix,release_agent=agent2 yyy /cgrp2
 mount: /cgrp2 not mounted already, or bad option
 # cat /cgrp2/release_agent
 agent1     <-- ok
 # mount -t cgroup -o remount,cpu,cpuset,release_agent=agent2 yyy /cgrp2
 mount: /cgrp2 is busy
 # cat /cgrp2/release_agent
 agent2     <-- unexpected!
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2009-04-02 16:57:30 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										goto  out_unlock ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									rebind_subsystems ( & cgrp_dfl_root ,  removed_mask ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( opts . release_agent )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										spin_lock ( & release_agent_path_lock ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										strcpy ( root - > release_agent_path ,  opts . release_agent ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										spin_unlock ( & release_agent_path_lock ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								 out_unlock : 
							 
						 
					
						
							
								
									
										
										
										
											2009-04-02 16:57:27 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									kfree ( opts . release_agent ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									kfree ( opts . name ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									mutex_unlock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  ret ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:39 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  To  reduce  the  fork ( )  overhead  for  systems  that  are  not  actually  using 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  their  cgroups  capability ,  we  don ' t  maintain  the  lists  running  through 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  each  css_set  to  its  tasks  until  we  see  the  list  actually  used  -  in  other 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  words  after  the  first  mount . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  bool  use_task_css_set_links  __read_mostly ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  void  cgroup_enable_task_cg_lists ( void )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  task_struct  * p ,  * g ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									down_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:39 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( use_task_css_set_links ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										goto  out_unlock ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									use_task_css_set_links  =  true ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  We  need  tasklist_lock  because  RCU  is  not  safe  against 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  while_each_thread ( ) .  Besides ,  a  forking  task  that  has  passed 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  cgroup_post_fork ( )  without  seeing  use_task_css_set_links  =  1 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  is  not  guaranteed  to  have  its  child  immediately  visible  in  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  tasklist  if  we  walk  through  it  with  RCU . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									read_lock ( & tasklist_lock ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									do_each_thread ( g ,  p )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										WARN_ON_ONCE ( ! list_empty ( & p - > cg_list )  | | 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											     task_css_set ( p )  ! =  & init_css_set ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  We  should  check  if  the  process  is  exiting ,  otherwise 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  it  will  race  with  cgroup_exit ( )  in  that  the  list 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  entry  won ' t  be  deleted  though  the  process  has  exited . 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 09:56:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										 *  Do  it  while  holding  siglock  so  that  we  don ' t  end  up 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  racing  against  cgroup_exit ( ) . 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:39 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 09:56:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										spin_lock_irq ( & p - > sighand - > siglock ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ! ( p - > flags  &  PF_EXITING ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											struct  css_set  * cset  =  task_css_set ( p ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											list_add ( & p - > cg_list ,  & cset - > tasks ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											get_css_set ( cset ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 09:56:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										spin_unlock_irq ( & p - > sighand - > siglock ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:39 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									}  while_each_thread ( g ,  p ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									read_unlock ( & tasklist_lock ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								out_unlock :  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									up_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:39 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  init_cgroup_housekeeping ( struct  cgroup  * cgrp )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									int  ssid ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									INIT_LIST_HEAD ( & cgrp - > self . sibling ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									INIT_LIST_HEAD ( & cgrp - > self . children ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									INIT_LIST_HEAD ( & cgrp - > cset_links ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									INIT_LIST_HEAD ( & cgrp - > release_list ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:27 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									INIT_LIST_HEAD ( & cgrp - > pidlists ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									mutex_init ( & cgrp - > pidlist_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:00 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgrp - > self . cgroup  =  cgrp ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgrp - > self . flags  | =  CSS_ONLINE ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									for_each_subsys ( ss ,  ssid ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										INIT_LIST_HEAD ( & cgrp - > e_csets [ ssid ] ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									init_waitqueue_head ( & cgrp - > offline_waitq ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  init_cgroup_root ( struct  cgroup_root  * root ,  
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											     struct  cgroup_sb_opts  * opts ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp  =  & root - > cgrp ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:54 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									INIT_LIST_HEAD ( & root - > root_list ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									atomic_set ( & root - > nr_cgrps ,  1 ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:40:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgrp - > root  =  root ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									init_cgroup_housekeeping ( cgrp ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-07-31 09:50:50 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									idr_init ( & root - > cgroup_idr ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									root - > flags  =  opts - > flags ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( opts - > release_agent ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										strcpy ( root - > release_agent_path ,  opts - > release_agent ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( opts - > name ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										strcpy ( root - > name ,  opts - > name ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:38 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( opts - > cpuset_clone_children ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										set_bit ( CGRP_CPUSET_CLONE_CHILDREN ,  & root - > cgrp . flags ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_setup_root ( struct  cgroup_root  * root ,  unsigned  int  ss_mask )  
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:23 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									LIST_HEAD ( tmp_links ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * root_cgrp  =  & root - > cgrp ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  css_set  * cset ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									int  i ,  ret ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:23 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									ret  =  cgroup_idr_alloc ( & root - > cgroup_idr ,  root_cgrp ,  1 ,  2 ,  GFP_NOWAIT ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ret  <  0 ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										goto  out ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									root_cgrp - > id  =  ret ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									ret  =  percpu_ref_init ( & root_cgrp - > self . refcnt ,  css_release ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ret ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										goto  out ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  We ' re  accessing  css_set_count  without  locking  css_set_rwsem  here , 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  but  that ' s  OK  -  it  can  only  be  increased  by  someone  holding 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  cgroup_lock ,  and  that ' s  us .  The  worst  that  can  happen  is  that  we 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  have  some  link  structures  left  over 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									ret  =  allocate_cgrp_cset_links ( css_set_count ,  & tmp_links ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ret ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										goto  cancel_ref ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									ret  =  cgroup_init_root_id ( root ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									if  ( ret ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										goto  cancel_ref ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									root - > kf_root  =  kernfs_create_root ( & cgroup_kf_syscall_ops , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													   KERNFS_ROOT_CREATE_DEACTIVATED , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													   root_cgrp ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( IS_ERR ( root - > kf_root ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										ret  =  PTR_ERR ( root - > kf_root ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										goto  exit_root_id ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									root_cgrp - > kn  =  root - > kf_root - > kn ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									ret  =  cgroup_addrm_files ( root_cgrp ,  cgroup_base_files ,  true ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ret ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										goto  destroy_root ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									ret  =  rebind_subsystems ( root ,  ss_mask ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ret ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										goto  destroy_root ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  There  must  be  no  failure  case  after  here ,  since  rebinding  takes 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  care  of  subsystems '  refcounts ,  which  are  explicitly  dropped  in 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  the  failure  exit  path . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									list_add ( & root - > root_list ,  & cgroup_roots ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgroup_root_count + + ; 
							 
						 
					
						
							
								
									
										
										
										
											2010-12-21 13:29:29 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  Link  the  root  cgroup  in  this  hierarchy  into  all  the  css_set 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  objects . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									down_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									hash_for_each ( css_set_table ,  i ,  cset ,  hlist ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										link_css_set ( & tmp_links ,  cset ,  root_cgrp ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									up_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									BUG_ON ( ! list_empty ( & root_cgrp - > self . children ) ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									BUG_ON ( atomic_read ( & root - > nr_cgrps )  ! =  1 ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									kernfs_activate ( root_cgrp - > kn ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									ret  =  0 ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									goto  out ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								destroy_root :  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									kernfs_destroy_root ( root - > kf_root ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									root - > kf_root  =  NULL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								exit_root_id :  
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_exit_root_id ( root ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								cancel_ref :  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									percpu_ref_cancel_init ( & root_cgrp - > self . refcnt ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								out :  
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									free_cgrp_cset_links ( & tmp_links ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  ret ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2010-07-26 13:23:11 +04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  struct  dentry  * cgroup_mount ( struct  file_system_type  * fs_type ,  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
											 int  flags ,  const  char  * unused_dev_name , 
							 
						 
					
						
							
								
									
										
										
										
											2010-07-26 13:23:11 +04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											 void  * data ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_root  * root ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									struct  cgroup_sb_opts  opts ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  dentry  * dentry ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  ret ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-04 17:14:41 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									bool  new_sb ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:38 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  The  first  time  anyone  tries  to  mount  a  cgroup ,  enable  the  list 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  linking  each  css_set  to  its  tasks  and  fix  up  all  existing  tasks . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! use_task_css_set_links ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgroup_enable_task_cg_lists ( ) ; 
							 
						 
					
						
							
								
									
										
										
											
												cgroup: fix the retry path of cgroup_mount()
If we hit the retry path, we'll call parse_cgroupfs_options() again,
but the string we pass to it has been modified by the previous call
to this function.
This bug can be observed by:
  # mount -t cgroup -o name=foo,cpuset xxx /mnt && umount /mnt && \
    mount -t cgroup -o name=foo,cpuset xxx /mnt
  mount: wrong fs type, bad option, bad superblock on xxx,
         missing codepage or helper program, or other error
  ...
The second mount passed "name=foo,cpuset" to the parser, and then it
hit the retry path and call the parser again, but this time the string
passed to the parser is "name=foo".
To fix this, we avoid calling parse_cgroupfs_options() again in this
case.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
											 
										 
										
											2014-04-17 13:53:08 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroups: revamp subsys array
This patch series provides the ability for cgroup subsystems to be
compiled as modules both within and outside the kernel tree.  This is
mainly useful for classifiers and subsystems that hook into components
that are already modules.  cls_cgroup and blkio-cgroup serve as the
example use cases for this feature.
It provides an interface cgroup_load_subsys() and cgroup_unload_subsys()
which modular subsystems can use to register and depart during runtime.
The net_cls classifier subsystem serves as the example for a subsystem
which can be converted into a module using these changes.
Patch #1 sets up the subsys[] array so its contents can be dynamic as
modules appear and (eventually) disappear.  Iterations over the array are
modified to handle when subsystems are absent, and the dynamic section of
the array is protected by cgroup_mutex.
Patch #2 implements an interface for modules to load subsystems, called
cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module
pointer in struct cgroup_subsys.
Patch #3 adds a mechanism for unloading modular subsystems, which includes
a more advanced rework of the rudimentary reference counting introduced in
patch 2.
Patch #4 modifies the net_cls subsystem, which already had some module
declarations, to be configurable as a module, which also serves as a
simple proof-of-concept.
Part of implementing patches 2 and 4 involved updating css pointers in
each css_set when the module appears or leaves.  In doing this, it was
discovered that css_sets always remain linked to the dummy cgroup,
regardless of whether or not any subsystems are actually bound to it
(i.e., not mounted on an actual hierarchy).  The subsystem loading and
unloading code therefore should keep in mind the special cases where the
added subsystem is the only one in the dummy cgroup (and therefore all
css_sets need to be linked back into it) and where the removed subsys was
the only one in the dummy cgroup (and therefore all css_sets should be
unlinked from it) - however, as all css_sets always stay attached to the
dummy cgroup anyway, these cases are ignored.  Any fix that addresses this
issue should also make sure these cases are addressed in the subsystem
loading and unloading code.
This patch:
Make subsys[] able to be dynamically populated to support modular
subsystems
This patch reworks the way the subsys[] array is used so that subsystems
can register themselves after boot time, and enables the internals of
cgroups to be able to handle when subsystems are not present or may
appear/disappear.
Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2010-03-10 15:22:07 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_lock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* First find the desired set of subsystems */ 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									ret  =  parse_cgroupfs_options ( data ,  & opts ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ret ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										goto  out_unlock ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:00 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* look for a matching existing root */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:55 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! opts . subsys_mask  & &  ! opts . none  & &  ! opts . name )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgrp_dfl_root_visible  =  true ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										root  =  & cgrp_dfl_root ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgroup_get ( & root - > cgrp ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										ret  =  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										goto  out_unlock ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for_each_root ( root )  { 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										bool  name_match  =  false ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 17:07:30 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( root  = =  & cgrp_dfl_root ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 17:07:30 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2010-03-10 15:22:09 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										/*
 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										 *  If  we  asked  for  a  name  then  it  must  match .   Also ,  if 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  name  matches  but  sybsys_mask  doesn ' t ,  we  should  fail . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  Remember  whether  name  matched . 
							 
						 
					
						
							
								
									
										
										
										
											2010-03-10 15:22:09 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										 */ 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( opts . name )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( strcmp ( opts . name ,  root - > name ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											name_match  =  true ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										/*
 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										 *  If  we  asked  for  subsystems  ( or  explicitly  for  no 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  subsystems )  then  they  must  match . 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										 */ 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ( opts . subsys_mask  | |  opts . none )  & & 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										    ( opts . subsys_mask  ! =  root - > subsys_mask ) )  { 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											if  ( ! name_match ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											ret  =  - EBUSY ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											goto  out_unlock ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce sane_behavior mount option
It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.
As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility.  This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors.  As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.
Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default.  Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.
This patch introduces the mount option and changes the following
behaviors in cgroup core.
* Mount options "noprefix" and "clone_children" are disallowed.  Also,
  cgroupfs file cgroup.clone_children is not created.
* When mounting an existing superblock, mount options should match.
  This is currently pretty crazy.  If one mounts a cgroup, creates a
  subdirectory, unmounts it and then mount it again with different
  option, it looks like the new options are applied but they aren't.
* Remount is disallowed.
The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.
v2: Dropped unnecessary explicit file permission setting sane_behavior
    cftype entry as suggested by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
											 
										 
										
											2013-04-14 20:15:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-29 14:06:10 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ( root - > flags  ^  opts . flags )  &  CGRP_ROOT_OPTION_MASK )  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-05-26 21:33:09 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											if  ( ( root - > flags  |  opts . flags )  &  CGRP_ROOT_SANE_BEHAVIOR )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:03 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												pr_err ( " sane_behavior: new mount options should match the existing superblock \n " ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-05-26 21:33:09 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												ret  =  - EINVAL ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												goto  out_unlock ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-05-26 21:33:09 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											}  else  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:03 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												pr_warn ( " new mount options do not match the existing superblock, will be ignored \n " ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-05-26 21:33:09 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											} 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce sane_behavior mount option
It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.
As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility.  This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors.  As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.
Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default.  Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.
This patch introduces the mount option and changes the following
behaviors in cgroup core.
* Mount options "noprefix" and "clone_children" are disallowed.  Also,
  cgroupfs file cgroup.clone_children is not created.
* When mounting an existing superblock, mount options should match.
  This is currently pretty crazy.  If one mounts a cgroup, creates a
  subdirectory, unmounts it and then mount it again with different
  option, it looks like the new options are applied but they aren't.
* Remount is disallowed.
The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.
v2: Dropped unnecessary explicit file permission setting sane_behavior
    cftype entry as suggested by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
											 
										 
										
											2013-04-14 20:15:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										/*
 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										 *  A  root ' s  lifetime  is  governed  by  its  root  cgroup . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  tryget_live  failure  indicate  that  the  root  is  being 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  destroyed .   Wait  for  destruction  to  complete  so  that  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  subsystems  are  free .   We  can  use  wait_queue  for  the  wait 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  but  this  path  is  super  cold .   Let ' s  just  sleep  for  a  bit 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  and  retry . 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ! percpu_ref_tryget_live ( & root - > cgrp . self . refcnt ) )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											mutex_unlock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											msleep ( 10 ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:00 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											ret  =  restart_syscall ( ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											goto  out_free ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										ret  =  0 ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										goto  out_unlock ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  No  such  thing ,  create  a  new  one .   name =  matching  without  subsys 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  specification  is  allowed  for  already  existing  hierarchies  but  we 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  can ' t  create  new  one  without  subsys  specification . 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! opts . subsys_mask  & &  ! opts . none )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										ret  =  - EINVAL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										goto  out_unlock ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									root  =  kzalloc ( sizeof ( * root ) ,  GFP_KERNEL ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! root )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										ret  =  - ENOMEM ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										goto  out_unlock ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-01-29 14:25:22 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2009-01-07 18:07:41 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									init_cgroup_root ( root ,  & opts ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:38 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									ret  =  cgroup_setup_root ( root ,  opts . subsys_mask ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ret ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgroup_free_root ( root ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-14 11:36:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								out_unlock :  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									mutex_unlock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:00 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								out_free :  
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									kfree ( opts . release_agent ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									kfree ( opts . name ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add xattr support
This is one of the items in the plumber's wish list.
For use cases:
>> What would the use case be for this?
>
> Attaching meta information to services, in an easily discoverable
> way. For example, in systemd we create one cgroup for each service, and
> could then store data like the main pid of the specific service as an
> xattr on the cgroup itself. That way we'd have almost all service state
> in the cgroupfs, which would make it possible to terminate systemd and
> later restart it without losing any state information. But there's more:
> for example, some very peculiar services cannot be terminated on
> shutdown (i.e. fakeraid DM stuff) and it would be really nice if the
> services in question could just mark that on their cgroup, by setting an
> xattr. On the more desktopy side of things there are other
> possibilities: for example there are plans defining what an application
> is along the lines of a cgroup (i.e. an app being a collection of
> processes). With xattrs one could then attach an icon or human readable
> program name on the cgroup.
>
> The key idea is that this would allow attaching runtime meta information
> to cgroups and everything they model (services, apps, vms), that doesn't
> need any complex userspace infrastructure, has good access control
> (i.e. because the file system enforces that anyway, and there's the
> "trusted." xattr namespace), notifications (inotify), and can easily be
> shared among applications.
>
> Lennart
v7:
- no changes
v6:
- remove user xattr namespace, only allow trusted and security
v5:
- check for capabilities before setting/removing xattrs
v4:
- no changes
v3:
- instead of config option, use mount option to enable xattr support
Original-patch-by: Li Zefan <lizefan@huawei.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
											 
										 
										
											2012-08-23 16:53:30 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ret ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  ERR_PTR ( ret ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-26 15:40:28 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									dentry  =  kernfs_mount ( fs_type ,  flags ,  root - > kf_root , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												CGROUP_SUPER_MAGIC ,  & new_sb ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-04 17:14:41 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( IS_ERR ( dentry )  | |  ! new_sb ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										cgroup_put ( & root - > cgrp ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  dentry ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  void  cgroup_kill_sb ( struct  super_block  * sb )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  kernfs_root  * kf_root  =  kernfs_root_from_sb ( sb ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_root  * root  =  cgroup_root_from_kf ( kf_root ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  If  @ root  doesn ' t  have  any  mounts  or  children ,  start  killing  it . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  This  prevents  new  mounts  by  disabling  percpu_ref_tryget_live ( ) . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  cgroup_mount ( )  may  wait  for  @ root ' s  release . 
							 
						 
					
						
							
								
									
										
										
										
											2014-06-04 16:48:15 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  And  don ' t  kill  the  default  root . 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-06-04 16:48:15 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( css_has_online_children ( & root - > cgrp . self )  | | 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									    root  = =  & cgrp_dfl_root ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										cgroup_put ( & root - > cgrp ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									else 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										percpu_ref_kill ( & root - > cgrp . self . refcnt ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									kernfs_kill_sb ( sb ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  struct  file_system_type  cgroup_fs_type  =  {  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. name  =  " cgroup " , 
							 
						 
					
						
							
								
									
										
										
										
											2010-07-26 13:23:11 +04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									. mount  =  cgroup_mount , 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									. kill_sb  =  cgroup_kill_sb , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								} ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2010-08-05 13:53:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  struct  kobject  * cgroup_kobj ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-04-14 20:50:08 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
										
										
											2013-07-11 16:34:48 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  task_cgroup_path  -  cgroup  path  of  a  task  in  the  first  cgroup  hierarchy 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-14 20:50:08 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ task :  target  task 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ buf :  the  buffer  to  write  the  path  into 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ buflen :  the  length  of  the  buffer 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2013-07-11 16:34:48 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  Determine  @ task ' s  cgroup  on  the  first  ( the  one  with  the  lowest  non - zero 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  hierarchy_id )  cgroup  hierarchy  and  copy  its  path  into  @ buf .   This 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  function  grabs  cgroup_mutex  and  shouldn ' t  be  used  inside  locks  used  by 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup  controller  callbacks . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  Return  value  is  the  same  as  kernfs_path ( ) . 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-14 20:50:08 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								char  * task_cgroup_path ( struct  task_struct  * task ,  char  * buf ,  size_t  buflen )  
						 
					
						
							
								
									
										
										
										
											2013-04-14 20:50:08 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_root  * root ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-07-11 16:34:48 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  hierarchy_id  =  1 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									char  * path  =  NULL ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-14 20:50:08 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									mutex_lock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									down_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-14 20:50:08 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-07-11 16:34:48 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									root  =  idr_get_next ( & cgroup_hierarchy_idr ,  & hierarchy_id ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-04-14 20:50:08 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( root )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgrp  =  task_cgroup_from_root ( task ,  root ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										path  =  cgroup_path ( cgrp ,  buf ,  buflen ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-07-11 16:34:48 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									}  else  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/* if no hierarchy exists, everyone is in "/" */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( strlcpy ( buf ,  " / " ,  buflen )  <  buflen ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											path  =  buf ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-14 20:50:08 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									up_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-14 20:50:08 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_unlock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  path ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-14 20:50:08 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
									
										
										
										
											2013-07-11 16:34:48 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								EXPORT_SYMBOL_GPL ( task_cgroup_path ) ;  
						 
					
						
							
								
									
										
										
										
											2013-04-14 20:50:08 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/* used to track tasks and other necessary states during migration */  
						 
					
						
							
								
									
										
										
										
											2011-12-12 18:12:21 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								struct  cgroup_taskset  {  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* the src and dst cset list running through cset->mg_node */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  list_head 	src_csets ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  list_head 	dst_csets ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Fields  for  cgroup_taskset_ * ( )  iteration . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Before  migration  is  committed ,  the  target  migration  tasks  are  on 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  - > mg_tasks  of  the  csets  on  - > src_csets .   After ,  on  - > mg_tasks  of 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  the  csets  on  - > dst_csets .   - > csets  point  to  either  - > src_csets 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  or  - > dst_csets  depending  on  whether  migration  is  committed . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  - > cur_csets  and  - > cur_task  point  to  the  current  task  position 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  during  iteration . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  list_head 	* csets ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  css_set 		* cur_cset ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  task_struct 	* cur_task ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-12-12 18:12:21 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								} ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_taskset_first  -  reset  taskset  and  return  the  first  task 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ tset :  taskset  of  interest 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ tset  iteration  is  initialized  and  the  first  task  is  returned . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								struct  task_struct  * cgroup_taskset_first ( struct  cgroup_taskset  * tset )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									tset - > cur_cset  =  list_first_entry ( tset - > csets ,  struct  css_set ,  mg_node ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									tset - > cur_task  =  NULL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  cgroup_taskset_next ( tset ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-12-12 18:12:21 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_taskset_next  -  iterate  to  the  next  task  in  taskset 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ tset :  taskset  of  interest 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Return  the  next  task  in  @ tset .   Iteration  must  have  been  initialized 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  with  cgroup_taskset_first ( ) . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								struct  task_struct  * cgroup_taskset_next ( struct  cgroup_taskset  * tset )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  css_set  * cset  =  tset - > cur_cset ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  task_struct  * task  =  tset - > cur_task ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-12-12 18:12:21 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									while  ( & cset - > mg_node  ! =  tset - > csets )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! task ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											task  =  list_first_entry ( & cset - > mg_tasks , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
														struct  task_struct ,  cg_list ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										else 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											task  =  list_next_entry ( task ,  cg_list ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-12-12 18:12:21 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( & task - > cg_list  ! =  & cset - > mg_tasks )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											tset - > cur_cset  =  cset ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											tset - > cur_task  =  task ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											return  task ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
										
										
											2011-12-12 18:12:21 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										cset  =  list_next_entry ( cset ,  mg_node ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										task  =  NULL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2011-12-12 18:12:21 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  NULL ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-12-12 18:12:21 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:41 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  cgroup_task_migrate  -  move  a  task  from  one  cgroup  to  another . 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-05 20:08:13 +02:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ old_cgrp :  the  cgroup  @ tsk  is  being  migrated  from 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:41 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ tsk :  the  task  being  migrated 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ new_cset :  the  new  css_set  @ tsk  is  being  attached  to 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:41 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  Must  be  called  with  cgroup_mutex ,  threadgroup  and  css_set_rwsem  locked . 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  cgroup_task_migrate ( struct  cgroup  * old_cgrp ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												struct  task_struct  * tsk , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												struct  css_set  * new_cset ) 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  css_set  * old_cset ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:41 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									lockdep_assert_held ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
										
										
											2011-12-21 20:18:35 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  We  are  synchronized  through  threadgroup_lock ( )  against  PF_EXITING 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  setting  such  that  we  can ' t  race  against  cgroup_exit ( )  changing  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  css_set  to  init_css_set  and  dropping  the  old  one . 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2011-12-21 20:03:18 +01:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									WARN_ON_ONCE ( tsk - > flags  &  PF_EXITING ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-21 15:52:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									old_cset  =  task_css_set ( tsk ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									get_css_set ( new_cset ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									rcu_assign_pointer ( tsk - > cgroups ,  new_cset ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 17:43:21 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Use  move_tail  so  that  cgroup_taskset_first ( )  still  returns  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  leader  after  migration .   This  works  because  cgroup_migrate ( ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  ensures  that  the  dst_cset  of  the  leader  is  the  first  on  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  tset ' s  dst_csets  list . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									list_move_tail ( & tsk - > cg_list ,  & new_cset - > mg_tasks ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  We  just  gained  a  reference  on  old_cset  by  taking  it  from  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  task .  As  trading  it  for  new_cset  is  protected  by  cgroup_mutex , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  we ' re  safe  to  drop  it  here ;  it  will  be  freed  under  RCU . 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									set_bit ( CGRP_RELEASABLE ,  & old_cgrp - > flags ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:41 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									put_css_set_locked ( old_cset ,  false ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2008-02-23 15:24:09 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  cgroup_migrate_finish  -  cleanup  after  attach 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ preloaded_csets :  list  of  preloaded  css_sets 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  Undo  cgroup_migrate_add_src ( )  and  cgroup_migrate_prepare_dst ( ) .   See 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  those  functions  for  details . 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  cgroup_migrate_finish ( struct  list_head  * preloaded_csets )  
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  css_set  * cset ,  * tmp_cset ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									down_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									list_for_each_entry_safe ( cset ,  tmp_cset ,  preloaded_csets ,  mg_preload_node )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cset - > mg_src_cgrp  =  NULL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cset - > mg_dst_cset  =  NULL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										list_del_init ( & cset - > mg_preload_node ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										put_css_set_locked ( cset ,  false ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									up_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_migrate_add_src  -  add  a  migration  source  css_set 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ src_cset :  the  source  css_set  to  add 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ dst_cgrp :  the  destination  cgroup 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ preloaded_csets :  list  of  preloaded  css_sets 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Tasks  belonging  to  @ src_cset  are  about  to  be  migrated  to  @ dst_cgrp .   Pin 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ src_cset  and  add  it  to  @ preloaded_csets ,  which  should  later  be  cleaned 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  up  by  cgroup_migrate_finish ( ) . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  This  function  may  be  called  without  holding  threadgroup_lock  even  if  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  target  is  a  process .   Threads  may  be  created  and  destroyed  but  as  long 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  as  cgroup_mutex  is  not  dropped ,  no  new  css_set  can  be  put  into  play  and 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  the  preloaded  css_sets  are  guaranteed  to  cover  all  migrations . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  void  cgroup_migrate_add_src ( struct  css_set  * src_cset ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												   struct  cgroup  * dst_cgrp , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												   struct  list_head  * preloaded_csets ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup  * src_cgrp ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									lockdep_assert_held ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									src_cgrp  =  cset_cgroup_from_root ( src_cset ,  dst_cgrp - > root ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! list_empty ( & src_cset - > mg_preload_node ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									WARN_ON ( src_cset - > mg_src_cgrp ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									WARN_ON ( ! list_empty ( & src_cset - > mg_tasks ) ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									WARN_ON ( ! list_empty ( & src_cset - > mg_node ) ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									src_cset - > mg_src_cgrp  =  src_cgrp ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									get_css_set ( src_cset ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									list_add ( & src_cset - > mg_preload_node ,  preloaded_csets ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_migrate_prepare_dst  -  prepare  destination  css_sets  for  migration 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ dst_cgrp :  the  destination  cgroup  ( may  be  % NULL ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ preloaded_csets :  list  of  preloaded  source  css_sets 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Tasks  are  about  to  be  moved  to  @ dst_cgrp  and  all  the  source  css_sets 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  have  been  preloaded  to  @ preloaded_csets .   This  function  looks  up  and 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  pins  all  destination  css_sets ,  links  each  to  its  source ,  and  append  them 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  to  @ preloaded_csets .   If  @ dst_cgrp  is  % NULL ,  the  destination  of  each 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  source  css_set  is  assumed  to  be  its  cgroup  on  the  default  hierarchy . 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  This  function  must  be  called  after  cgroup_migrate_add_src ( )  has  been 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  called  on  each  migration  source  css_set .   After  migration  is  performed 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  using  cgroup_migrate ( ) ,  cgroup_migrate_finish ( )  must  be  called  on 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ preloaded_csets . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  int  cgroup_migrate_prepare_dst ( struct  cgroup  * dst_cgrp ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												      struct  list_head  * preloaded_csets ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									LIST_HEAD ( csets ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  css_set  * src_cset ,  * tmp_cset ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Except  for  the  root ,  child_subsys_mask  must  be  zero  for  a  cgroup 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  with  tasks  so  that  child  cgroups  don ' t  compete  against  tasks . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( dst_cgrp  & &  cgroup_on_dfl ( dst_cgrp )  & &  cgroup_parent ( dst_cgrp )  & & 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									    dst_cgrp - > child_subsys_mask ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  - EBUSY ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* look up the dst cset for each src cset and link it to src */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									list_for_each_entry_safe ( src_cset ,  tmp_cset ,  preloaded_csets ,  mg_preload_node )  { 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										struct  css_set  * dst_cset ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										dst_cset  =  find_css_set ( src_cset , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													dst_cgrp  ? :  src_cset - > dfl_cgrp ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ! dst_cset ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											goto  err ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										WARN_ON_ONCE ( src_cset - > mg_dst_cset  | |  dst_cset - > mg_dst_cset ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  If  src  cset  equals  dst ,  it ' s  noop .   Drop  the  src . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  cgroup_migrate ( )  will  skip  the  cset  too .   Note  that  we 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  can ' t  handle  src  = =  dst  as  some  nodes  are  used  by  both . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( src_cset  = =  dst_cset )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											src_cset - > mg_src_cgrp  =  NULL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											list_del_init ( & src_cset - > mg_preload_node ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											put_css_set ( src_cset ,  false ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											put_css_set ( dst_cset ,  false ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										src_cset - > mg_dst_cset  =  dst_cset ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( list_empty ( & dst_cset - > mg_preload_node ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											list_add ( & dst_cset - > mg_preload_node ,  & csets ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										else 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											put_css_set ( dst_cset ,  false ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									list_splice_tail ( & csets ,  preloaded_csets ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								err :  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgroup_migrate_finish ( & csets ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  - ENOMEM ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_migrate  -  migrate  a  process  or  task  to  a  cgroup 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cgrp :  the  destination  cgroup 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ leader :  the  leader  of  the  process  or  the  task  to  migrate 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ threadgroup :  whether  @ leader  points  to  the  whole  process  or  a  single  task 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Migrate  a  process  or  task  denoted  by  @ leader  to  @ cgrp .   If  migrating  a 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  process ,  the  caller  must  be  holding  threadgroup_lock  of  @ leader .   The 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  caller  is  also  responsible  for  invoking  cgroup_migrate_add_src ( )  and 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_migrate_prepare_dst ( )  on  the  targets  before  invoking  this 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  function  and  following  up  with  cgroup_migrate_finish ( ) . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  As  long  as  a  controller ' s  - > can_attach ( )  doesn ' t  fail ,  this  function  is 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  guaranteed  to  succeed .   This  means  that ,  excluding  - > can_attach ( ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  failure ,  when  migrating  multiple  targets ,  the  success  or  failure  can  be 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  decided  for  all  targets  by  invoking  group_migrate_prepare_dst ( )  before 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  actually  starting  migrating . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  int  cgroup_migrate ( struct  cgroup  * cgrp ,  struct  task_struct  * leader ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											  bool  threadgroup ) 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_taskset  tset  =  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. src_csets 	=  LIST_HEAD_INIT ( tset . src_csets ) , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. dst_csets 	=  LIST_HEAD_INIT ( tset . dst_csets ) , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. csets 		=  & tset . src_csets , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * css ,  * failed_css  =  NULL ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  css_set  * cset ,  * tmp_cset ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  task_struct  * task ,  * tmp_task ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									int  i ,  ret ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-01-03 21:18:31 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Prevent  freeing  of  tasks  while  we  take  a  snapshot .  Tasks  that  are 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  already  PF_EXITING  could  be  freed  from  underneath  us  unless  we 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  take  an  rcu_read_lock . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									down_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-01-03 21:18:31 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									rcu_read_lock ( ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:43 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									task  =  leader ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									do  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:43 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										/* @task either already exited or can't exit until the end */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( task - > flags  &  PF_EXITING ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-10-12 10:59:17 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											goto  next ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-12-12 18:12:21 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										/* leave @task alone if post_fork() hasn't linked it yet */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( list_empty ( & task - > cg_list ) ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-10-12 10:59:17 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											goto  next ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-12-12 18:12:21 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										cset  =  task_css_set ( task ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ! cset - > mg_src_cgrp ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-10-12 10:59:17 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											goto  next ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-01-30 12:51:56 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										/*
 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 17:43:21 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										 *  cgroup_taskset_first ( )  must  always  return  the  leader . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  Take  care  to  avoid  disturbing  the  ordering . 
							 
						 
					
						
							
								
									
										
										
										
											2012-01-30 12:51:56 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 17:43:21 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										list_move_tail ( & task - > cg_list ,  & cset - > mg_tasks ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( list_empty ( & cset - > mg_node ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											list_add_tail ( & cset - > mg_node ,  & tset . src_csets ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( list_empty ( & cset - > mg_dst_cset - > mg_node ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											list_move_tail ( & cset - > mg_dst_cset - > mg_node , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												       & tset . dst_csets ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-10-12 10:59:17 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									next : 
							 
						 
					
						
							
								
									
										
										
										
											2013-03-13 09:17:09 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ! threadgroup ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											break ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:43 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									}  while_each_thread ( leader ,  task ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-01-03 21:18:31 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									rcu_read_unlock ( ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									up_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2011-12-12 18:12:21 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* methods shouldn't be called if no task is actually migrating */ 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( list_empty ( & tset . src_csets ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  0 ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-12-12 18:12:21 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* check that we can legitimately attach to the cgroup */ 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css.  This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups.  compare_css_sets()
already compares both, not for correctness but for optimization.  As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for_each_e_css ( css ,  i ,  cgrp )  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( css - > ss - > can_attach )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:43 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											ret  =  css - > ss - > can_attach ( css ,  & tset ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( ret )  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												failed_css  =  css ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												goto  out_cancel_attach ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  Now  that  we ' re  guaranteed  success ,  proceed  to  move  all  tasks  to 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  the  new  cgroup .   There  are  no  failure  cases  after  here ,  so  this 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  is  the  commit  point . 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:41 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									down_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									list_for_each_entry ( cset ,  & tset . src_csets ,  mg_node )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										list_for_each_entry_safe ( task ,  tmp_task ,  & cset - > mg_tasks ,  cg_list ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											cgroup_task_migrate ( cset - > mg_src_cgrp ,  task , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													    cset - > mg_dst_cset ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:41 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									up_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  Migration  is  committed ,  all  target  tasks  are  now  on  dst_csets . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Nothing  is  sensitive  to  fork ( )  after  this  point .   Notify 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  controllers  that  migration  is  complete . 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									tset . csets  =  & tset . dst_csets ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css.  This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups.  compare_css_sets()
already compares both, not for correctness but for optimization.  As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for_each_e_css ( css ,  i ,  cgrp ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( css - > ss - > attach ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											css - > ss - > attach ( css ,  & tset ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:43 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									ret  =  0 ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									goto  out_release_tset ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								out_cancel_attach :  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css.  This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups.  compare_css_sets()
already compares both, not for correctness but for optimization.  As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for_each_e_css ( css ,  i ,  cgrp )  { 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( css  = =  failed_css ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											break ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( css - > ss - > cancel_attach ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											css - > ss - > cancel_attach ( css ,  & tset ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								out_release_tset :  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									down_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									list_splice_init ( & tset . dst_csets ,  & tset . src_csets ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									list_for_each_entry_safe ( cset ,  tmp_cset ,  & tset . src_csets ,  mg_node )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 17:43:21 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										list_splice_tail_init ( & cset - > mg_tasks ,  & cset - > tasks ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks.  The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration.  This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists.  cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor.  cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										list_del_init ( & cset - > mg_node ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									up_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:43 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  ret ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_attach_task  -  attach  a  task  or  a  whole  threadgroup  to  a  cgroup 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ dst_cgrp :  the  cgroup  to  attach  to 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ leader :  the  task  or  the  leader  of  the  threadgroup  to  be  attached 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ threadgroup :  attach  the  whole  threadgroup ? 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  Call  holding  cgroup_mutex  and  threadgroup_lock  of  @ leader . 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks.  One problem with this approach is migration
of multiple targets.  It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks().  Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration.  If preparations of
all targets succeed, the whole thing will succeed.  If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets.  cgroup_migrate_prepare()
pins all source and target css_sets and link them up.  Note that this
can be performed without holding threadgroup_lock even if the target
is a process.  As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  int  cgroup_attach_task ( struct  cgroup  * dst_cgrp ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											      struct  task_struct  * leader ,  bool  threadgroup ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									LIST_HEAD ( preloaded_csets ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  task_struct  * task ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									int  ret ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* look up all src csets */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									down_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									rcu_read_lock ( ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									task  =  leader ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									do  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgroup_migrate_add_src ( task_css_set ( task ) ,  dst_cgrp , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												       & preloaded_csets ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! threadgroup ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											break ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									}  while_each_thread ( leader ,  task ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									rcu_read_unlock ( ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									up_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* prepare dst csets and commit */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									ret  =  cgroup_migrate_prepare_dst ( dst_cgrp ,  & preloaded_csets ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! ret ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										ret  =  cgroup_migrate ( dst_cgrp ,  leader ,  threadgroup ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgroup_migrate_finish ( & preloaded_csets ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  ret ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Find  the  task_struct  of  the  task  to  attach  by  vpid  and  pass  it  along  to  the 
							 
						 
					
						
							
								
									
										
										
										
											2011-12-12 18:12:21 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  function  to  attach  either  it  or  all  tasks  in  its  threadgroup .  Will  lock 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  cgroup_mutex  and  threadgroup . 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  ssize_t  __cgroup_procs_write ( struct  kernfs_open_file  * of ,  char  * buf ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												    size_t  nbytes ,  loff_t  off ,  bool  threadgroup ) 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  task_struct  * tsk ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-11-14 10:39:19 +11:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									const  struct  cred  * cred  =  current_cred ( ) ,  * tcred ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									pid_t  pid ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  ret ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( kstrtoint ( strstrip ( buf ) ,  0 ,  & pid )  | |  pid  <  0 ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  - EINVAL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgrp  =  cgroup_kn_lock_live ( of - > kn ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! cgrp ) 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  - ENODEV ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-01-03 21:18:30 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								retry_find_task :  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									rcu_read_lock ( ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( pid )  { 
							 
						 
					
						
							
								
									
										
										
										
											2008-02-07 00:14:47 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										tsk  =  find_task_by_vpid ( pid ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ! tsk )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											rcu_read_unlock ( ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-01-18 16:56:47 +09:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											ret  =  - ESRCH ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-01-03 21:18:30 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											goto  out_unlock_cgroup ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  even  if  we ' re  attaching  all  tasks  in  the  thread  group ,  we 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  only  need  to  check  permissions  on  one  of  them . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 */ 
							 
						 
					
						
							
								
									
										
										
										
											2008-11-14 10:39:19 +11:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										tcred  =  __task_cred ( tsk ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-03-12 15:44:39 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ! uid_eq ( cred - > euid ,  GLOBAL_ROOT_UID )  & & 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										    ! uid_eq ( cred - > euid ,  tcred - > uid )  & & 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										    ! uid_eq ( cred - > euid ,  tcred - > suid ) )  { 
							 
						 
					
						
							
								
									
										
										
										
											2008-11-14 10:39:19 +11:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											rcu_read_unlock ( ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-01-03 21:18:30 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											ret  =  - EACCES ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											goto  out_unlock_cgroup ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
										
										
											2012-01-03 21:18:30 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									}  else 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										tsk  =  current ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-12-12 18:12:21 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( threadgroup ) 
							 
						 
					
						
							
								
									
										
										
										
											2012-01-03 21:18:30 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										tsk  =  tsk - > group_leader ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-21 09:13:46 +02:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
										
										
											2013-03-19 13:45:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  Workqueue  threads  may  acquire  PF_NO_SETAFFINITY  and  become 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-21 09:13:46 +02:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  trapped  in  a  cpuset ,  or  RT  worker  may  be  born  in  a  cgroup 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  with  no  rt_runtime  allocated .   Just  say  no . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-03-19 13:45:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( tsk  = =  kthreadd_task  | |  ( tsk - > flags  &  PF_NO_SETAFFINITY ) )  { 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-21 09:13:46 +02:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										ret  =  - EINVAL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										rcu_read_unlock ( ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										goto  out_unlock_cgroup ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-01-03 21:18:30 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									get_task_struct ( tsk ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									rcu_read_unlock ( ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									threadgroup_lock ( tsk ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( threadgroup )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! thread_group_leader ( tsk ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 *  a  race  with  de_thread  from  another  thread ' s  exec ( ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 *  may  strip  us  of  our  leadership ,  if  this  happens , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 *  there  is  no  choice  but  to  throw  this  task  away  and 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 *  try  again ;  this  is 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 *  " double-double-toil-and-trouble-check locking " . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											threadgroup_unlock ( tsk ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											put_task_struct ( tsk ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											goto  retry_find_task ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
										
										
											2013-03-13 09:17:09 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									ret  =  cgroup_attach_task ( cgrp ,  tsk ,  threadgroup ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2011-12-12 18:12:21 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									threadgroup_unlock ( tsk ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									put_task_struct ( tsk ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-01-03 21:18:30 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								out_unlock_cgroup :  
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_kn_unlock ( of - > kn ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  ret  ? :  nbytes ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-04-07 09:29:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_attach_task_all  -  attach  task  ' tsk '  to  all  cgroups  of  task  ' from ' 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ from :  attach  to  all  cgroups  of  a  given  task 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ tsk :  the  task  to  be  attached 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								int  cgroup_attach_task_all ( struct  task_struct  * from ,  struct  task_struct  * tsk )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_root  * root ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-07 09:29:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  retval  =  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-04-07 09:29:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_lock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for_each_root ( root )  { 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										struct  cgroup  * from_cgrp ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( root  = =  & cgrp_dfl_root ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										down_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										from_cgrp  =  task_cgroup_from_root ( from ,  root ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										up_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-07 09:29:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-07-31 16:18:36 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										retval  =  cgroup_attach_task ( from_cgrp ,  tsk ,  false ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-07 09:29:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( retval ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											break ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-07 09:29:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_unlock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-07 09:29:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  retval ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								EXPORT_SYMBOL_GPL ( cgroup_attach_task_all ) ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  ssize_t  cgroup_tasks_write ( struct  kernfs_open_file  * of ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												  char  * buf ,  size_t  nbytes ,  loff_t  off ) 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  __cgroup_procs_write ( of ,  buf ,  nbytes ,  off ,  false ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  ssize_t  cgroup_procs_write ( struct  kernfs_open_file  * of ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												  char  * buf ,  size_t  nbytes ,  loff_t  off ) 
							 
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:47:01 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  __cgroup_procs_write ( of ,  buf ,  nbytes ,  off ,  true ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:47:01 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:21 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  ssize_t  cgroup_release_agent_write ( struct  kernfs_open_file  * of ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													  char  * buf ,  size_t  nbytes ,  loff_t  off ) 
							 
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:46:59 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									BUILD_BUG_ON ( sizeof ( cgrp - > root - > release_agent_path )  <  PATH_MAX ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgrp  =  cgroup_kn_lock_live ( of - > kn ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! cgrp ) 
							 
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:46:59 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  - ENODEV ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									spin_lock ( & release_agent_path_lock ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									strlcpy ( cgrp - > root - > release_agent_path ,  strstrip ( buf ) , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										sizeof ( cgrp - > root - > release_agent_path ) ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									spin_unlock ( & release_agent_path_lock ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_kn_unlock ( of - > kn ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:21 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  nbytes ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:46:59 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_release_agent_show ( struct  seq_file  * seq ,  void  * v )  
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:46:59 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp  =  seq_css ( seq ) - > cgroup ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:11:00 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									spin_lock ( & release_agent_path_lock ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:46:59 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									seq_puts ( seq ,  cgrp - > root - > release_agent_path ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:11:00 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									spin_unlock ( & release_agent_path_lock ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:46:59 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									seq_putc ( seq ,  ' \n ' ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_sane_behavior_show ( struct  seq_file  * seq ,  void  * v )  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce sane_behavior mount option
It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.
As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility.  This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors.  As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.
Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default.  Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.
This patch introduces the mount option and changes the following
behaviors in cgroup core.
* Mount options "noprefix" and "clone_children" are disallowed.  Also,
  cgroupfs file cgroup.clone_children is not created.
* When mounting an existing superblock, mount options should match.
  This is currently pretty crazy.  If one mounts a cgroup, creates a
  subdirectory, unmounts it and then mount it again with different
  option, it looks like the new options are applied but they aren't.
* Remount is disallowed.
The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.
v2: Dropped unnecessary explicit file permission setting sane_behavior
    cftype entry as suggested by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
											 
										 
										
											2013-04-14 20:15:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp  =  seq_css ( seq ) - > cgroup ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									seq_printf ( seq ,  " %d \n " ,  cgroup_sane_behavior ( cgrp ) ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:46:59 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  cgroup_print_ss_mask ( struct  seq_file  * seq ,  unsigned  int  ss_mask )  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									bool  printed  =  false ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									int  ssid ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for_each_subsys ( ss ,  ssid )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ss_mask  &  ( 1  < <  ssid ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( printed ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												seq_putc ( seq ,  '   ' ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											seq_printf ( seq ,  " %s " ,  ss - > name ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											printed  =  true ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:06 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( printed ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										seq_putc ( seq ,  ' \n ' ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/* show controllers which are currently attached to the default hierarchy */  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  int  cgroup_root_controllers_show ( struct  seq_file  * seq ,  void  * v )  
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:46:58 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp  =  seq_css ( seq ) - > cgroup ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 19:33:07 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_print_ss_mask ( seq ,  cgrp - > root - > subsys_mask  & 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											     ~ cgrp_dfl_root_inhibit_ss_mask ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:46:58 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/* show controllers which are enabled from the parent */  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  int  cgroup_controllers_show ( struct  seq_file  * seq ,  void  * v )  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp  =  seq_css ( seq ) - > cgroup ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_print_ss_mask ( seq ,  cgroup_parent ( cgrp ) - > child_subsys_mask ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/* show controllers which are enabled for a given cgroup's children */  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  int  cgroup_subtree_control_show ( struct  seq_file  * seq ,  void  * v )  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp  =  seq_css ( seq ) - > cgroup ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgroup_print_ss_mask ( seq ,  cgrp - > child_subsys_mask ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_update_dfl_csses  -  update  css  assoc  of  a  subtree  in  default  hierarchy 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cgrp :  root  of  the  subtree  to  update  csses  for 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cgrp ' s  child_subsys_mask  has  changed  and  its  subtree ' s  ( self  excluded ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  css  associations  need  to  be  updated  accordingly .   This  function  looks  up 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  all  css_sets  which  are  attached  to  the  subtree ,  creates  the  matching 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  updated  css_sets  and  migrates  the  tasks  to  the  new  ones . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  int  cgroup_update_dfl_csses ( struct  cgroup  * cgrp )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									LIST_HEAD ( preloaded_csets ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * css ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  css_set  * src_cset ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									int  ret ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* look up all csses currently attached to @cgrp's subtree */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									down_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									css_for_each_descendant_pre ( css ,  cgroup_css ( cgrp ,  NULL ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										struct  cgrp_cset_link  * link ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/* self is not affected by child_subsys_mask change */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( css - > cgroup  = =  cgrp ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										list_for_each_entry ( link ,  & css - > cgroup - > cset_links ,  cset_link ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											cgroup_migrate_add_src ( link - > cset ,  cgrp , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													       & preloaded_csets ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									up_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* NULL dst indicates self on default hierarchy */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									ret  =  cgroup_migrate_prepare_dst ( NULL ,  & preloaded_csets ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ret ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										goto  out_finish ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									list_for_each_entry ( src_cset ,  & preloaded_csets ,  mg_preload_node )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										struct  task_struct  * last_task  =  NULL ,  * task ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/* src_csets precede dst_csets, break on the first dst_cset */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! src_cset - > mg_src_cgrp ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											break ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  All  tasks  in  src_cset  need  to  be  migrated  to  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  matching  dst_cset .   Empty  it  process  by  process .   We 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  walk  tasks  but  migrate  processes .   The  leader  might  even 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  belong  to  a  different  cset  but  such  src_cset  would  also 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  be  among  the  target  src_csets  because  the  default 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  hierarchy  enforces  per - process  membership . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										while  ( true )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											down_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											task  =  list_first_entry_or_null ( & src_cset - > tasks , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
														struct  task_struct ,  cg_list ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( task )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												task  =  task - > group_leader ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												WARN_ON_ONCE ( ! task_css_set ( task ) - > mg_src_cgrp ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												get_task_struct ( task ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											up_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( ! task ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												break ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											/* guard against possible infinite loop */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( WARN ( last_task  = =  task , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												 " cgroup: update_dfl_csses failed to make progress, aborting in inconsistent state \n " ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												goto  out_finish ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											last_task  =  task ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											threadgroup_lock ( task ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											/* raced against de_thread() from another thread? */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( ! thread_group_leader ( task ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												threadgroup_unlock ( task ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												put_task_struct ( task ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											ret  =  cgroup_migrate ( src_cset - > dfl_cgrp ,  task ,  true ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											threadgroup_unlock ( task ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											put_task_struct ( task ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( WARN ( ret ,  " cgroup: failed to update controllers for the default hierarchy (%d), further operations may crash or hang \n " ,  ret ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												goto  out_finish ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								out_finish :  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgroup_migrate_finish ( & preloaded_csets ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  ret ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								/* change the enabled child controllers for a cgroup in the default hierarchy */  
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:21 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  ssize_t  cgroup_subtree_control_write ( struct  kernfs_open_file  * of ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													    char  * buf ,  size_t  nbytes , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													    loff_t  off ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:11:00 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									unsigned  int  enable  =  0 ,  disable  =  0 ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp ,  * child ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:21 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									char  * tok ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  ssid ,  ret ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:10:59 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  Parse  input  -  space  separated  list  of  subsystem  names  prefixed 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  with  either  +  or  - . 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:21 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									buf  =  strstrip ( buf ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									while  ( ( tok  =  strsep ( & buf ,  "   " ) ) )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:10:59 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( tok [ 0 ]  = =  ' \0 ' ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										for_each_subsys ( ss ,  ssid )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 19:33:07 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											if  ( ss - > disabled  | |  strcmp ( tok  +  1 ,  ss - > name )  | | 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											    ( ( 1  < <  ss - > id )  &  cgrp_dfl_root_inhibit_ss_mask ) ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( * tok  = =  ' + ' )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:11:00 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												enable  | =  1  < <  ssid ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												disable  & =  ~ ( 1  < <  ssid ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											}  else  if  ( * tok  = =  ' - ' )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:11:00 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												disable  | =  1  < <  ssid ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												enable  & =  ~ ( 1  < <  ssid ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											}  else  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												return  - EINVAL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											break ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ssid  = =  CGROUP_SUBSYS_COUNT ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											return  - EINVAL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgrp  =  cgroup_kn_lock_live ( of - > kn ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! cgrp ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  - ENODEV ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									for_each_subsys ( ss ,  ssid )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( enable  &  ( 1  < <  ssid ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( cgrp - > child_subsys_mask  &  ( 1  < <  ssid ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												enable  & =  ~ ( 1  < <  ssid ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 *  Because  css  offlining  is  asynchronous ,  userland 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 *  might  try  to  re - enable  the  same  controller  while 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 *  the  previous  instance  is  still  around .   In  such 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 *  cases ,  wait  till  it ' s  gone  using  offline_waitq . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											cgroup_for_each_live_child ( child ,  cgrp )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:10:59 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												DEFINE_WAIT ( wait ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												if  ( ! cgroup_css ( child ,  ss ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:10:59 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												cgroup_get ( child ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												prepare_to_wait ( & child - > offline_waitq ,  & wait , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
														TASK_UNINTERRUPTIBLE ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												cgroup_kn_unlock ( of - > kn ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												schedule ( ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												finish_wait ( & child - > offline_waitq ,  & wait ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:10:59 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												cgroup_put ( child ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:11:00 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												return  restart_syscall ( ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											/* unavailable or not enabled on the parent? */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( ! ( cgrp_dfl_root . subsys_mask  &  ( 1  < <  ssid ) )  | | 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											    ( cgroup_parent ( cgrp )  & & 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											     ! ( cgroup_parent ( cgrp ) - > child_subsys_mask  &  ( 1  < <  ssid ) ) ) )  { 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												ret  =  - ENOENT ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												goto  out_unlock ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										}  else  if  ( disable  &  ( 1  < <  ssid ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( ! ( cgrp - > child_subsys_mask  &  ( 1  < <  ssid ) ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												disable  & =  ~ ( 1  < <  ssid ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											/* a child has it enabled? */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											cgroup_for_each_live_child ( child ,  cgrp )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												if  ( child - > child_subsys_mask  &  ( 1  < <  ssid ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													ret  =  - EBUSY ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
													goto  out_unlock ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! enable  & &  ! disable )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										ret  =  0 ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										goto  out_unlock ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Except  for  the  root ,  child_subsys_mask  must  be  zero  for  a  cgroup 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  with  tasks  so  that  child  cgroups  don ' t  compete  against  tasks . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( enable  & &  cgroup_parent ( cgrp )  & &  ! list_empty ( & cgrp - > cset_links ) )  { 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										ret  =  - EBUSY ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										goto  out_unlock ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Create  csses  for  enables  and  update  child_subsys_mask .   This 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  changes  cgroup_e_css ( )  results  which  in  turn  makes  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  subsequent  cgroup_update_dfl_csses ( )  associate  all  tasks  in  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  subtree  to  the  updated  csses . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									for_each_subsys ( ss ,  ssid )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! ( enable  &  ( 1  < <  ssid ) ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgroup_for_each_live_child ( child ,  cgrp )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											ret  =  create_css ( child ,  ss ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( ret ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												goto  err_undo_css ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgrp - > child_subsys_mask  | =  enable ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgrp - > child_subsys_mask  & =  ~ disable ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									ret  =  cgroup_update_dfl_csses ( cgrp ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ret ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										goto  err_undo_css ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* all tasks are now migrated away from the old csses, kill them */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									for_each_subsys ( ss ,  ssid )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! ( disable  &  ( 1  < <  ssid ) ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgroup_for_each_live_child ( child ,  cgrp ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											kill_css ( cgroup_css ( child ,  ss ) ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									kernfs_activate ( cgrp - > kn ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									ret  =  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								out_unlock :  
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_kn_unlock ( of - > kn ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:21 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  ret  ? :  nbytes ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								err_undo_css :  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgrp - > child_subsys_mask  & =  ~ enable ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgrp - > child_subsys_mask  | =  disable ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									for_each_subsys ( ss ,  ssid )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! ( enable  &  ( 1  < <  ssid ) ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgroup_for_each_live_child ( child ,  cgrp )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											struct  cgroup_subsys_state  * css  =  cgroup_css ( child ,  ss ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( css ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												kill_css ( css ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									goto  out_unlock ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_populated_show ( struct  seq_file  * seq ,  void  * v )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									seq_printf ( seq ,  " %d \n " ,  ( bool ) seq_css ( seq ) - > cgroup - > populated_cnt ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  ssize_t  cgroup_file_write ( struct  kernfs_open_file  * of ,  char  * buf ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												 size_t  nbytes ,  loff_t  off ) 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp  =  of - > kn - > parent - > priv ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cftype  * cft  =  of - > kn - > priv ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * css ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  ret ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:21 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( cft - > write ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  cft - > write ( of ,  buf ,  nbytes ,  off ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  kernfs  guarantees  that  a  file  isn ' t  deleted  with  operations  in 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  flight ,  which  means  that  the  matching  css  is  and  stays  alive  and 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  doesn ' t  need  to  be  pinned .   The  RCU  locking  is  not  necessary 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  either .   It ' s  just  for  the  convenience  of  using  cgroup_css ( ) . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									rcu_read_lock ( ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									css  =  cgroup_css ( cgrp ,  cft - > ss ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									rcu_read_unlock ( ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:21 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( cft - > write_u64 )  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										unsigned  long  long  v ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										ret  =  kstrtoull ( buf ,  0 ,  & v ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! ret ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											ret  =  cft - > write_u64 ( css ,  cft ,  v ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									}  else  if  ( cft - > write_s64 )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										long  long  v ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										ret  =  kstrtoll ( buf ,  0 ,  & v ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! ret ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											ret  =  cft - > write_s64 ( css ,  cft ,  v ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:06 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									}  else  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										ret  =  - EINVAL ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:06 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  ret  ? :  nbytes ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  * cgroup_seqfile_start ( struct  seq_file  * seq ,  loff_t  * ppos )  
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:46:58 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  seq_cft ( seq ) - > seq_start ( seq ,  ppos ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:46:58 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  * cgroup_seqfile_next ( struct  seq_file  * seq ,  void  * v ,  loff_t  * ppos )  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  seq_cft ( seq ) - > seq_next ( seq ,  v ,  ppos ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  cgroup_seqfile_stop ( struct  seq_file  * seq ,  void  * v )  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									seq_cft ( seq ) - > seq_stop ( seq ,  v ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:01 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_seqfile_show ( struct  seq_file  * m ,  void  * arg )  
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:06 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cftype  * cft  =  seq_cft ( m ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * css  =  seq_css ( m ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:06 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( cft - > seq_show ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  cft - > seq_show ( m ,  arg ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:06 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2008-04-29 00:59:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( cft - > read_u64 ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										seq_printf ( m ,  " %llu \n " ,  cft - > read_u64 ( css ,  cft ) ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									else  if  ( cft - > read_s64 ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										seq_printf ( m ,  " %lld \n " ,  cft - > read_s64 ( css ,  cft ) ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									else 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  - EINVAL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:01 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  struct  kernfs_ops  cgroup_kf_single_ops  =  {  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. atomic_write_len 	=  PAGE_SIZE , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. write 			=  cgroup_file_write , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. seq_show 		=  cgroup_seqfile_show , 
							 
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:01 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								} ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  struct  kernfs_ops  cgroup_kf_ops  =  {  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. atomic_write_len 	=  PAGE_SIZE , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. write 			=  cgroup_file_write , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. seq_start 		=  cgroup_seqfile_start , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. seq_next 		=  cgroup_seqfile_next , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. seq_stop 		=  cgroup_seqfile_stop , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. seq_show 		=  cgroup_seqfile_show , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								} ;  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_rename  -  Only  allow  simple  rename  of  directories  in  place . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_rename ( struct  kernfs_node  * kn ,  struct  kernfs_node  * new_parent ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 const  char  * new_name_str ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp  =  kn - > priv ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-03-01 15:01:56 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  ret ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( kernfs_type ( kn )  ! =  KERNFS_DIR ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
										return  - ENOTDIR ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( kn - > parent  ! =  new_parent ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
										return  - EIO ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-03-01 15:01:56 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-14 11:18:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  This  isn ' t  a  proper  migration  and  its  usefulness  is  very 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  limited .   Disallow  if  sane_behavior . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( cgroup_sane_behavior ( cgrp ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  - EPERM ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-04-02 16:57:29 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-20 11:10:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  We ' re  gonna  grab  cgroup_mutex  which  nests  outside  kernfs 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-20 11:10:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  active_ref .   kernfs_rename ( )  doesn ' t  require  active_ref 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  protection .   Break  them  before  grabbing  cgroup_mutex . 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-20 11:10:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									kernfs_break_active_protection ( new_parent ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									kernfs_break_active_protection ( kn ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-04-02 16:57:29 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_lock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-04-02 16:57:29 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									ret  =  kernfs_rename ( kn ,  new_parent ,  new_name_str ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-04-02 16:57:29 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_unlock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-20 11:10:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									kernfs_unbreak_active_protection ( kn ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									kernfs_unbreak_active_protection ( new_parent ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  ret ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-04-02 16:57:29 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-07 16:44:47 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/* set uid and gid of cgroup dirs and files to that of the creator */  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  int  cgroup_kn_set_ugid ( struct  kernfs_node  * kn )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  iattr  iattr  =  {  . ia_valid  =  ATTR_UID  |  ATTR_GID , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											       . ia_uid  =  current_fsuid ( ) , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											       . ia_gid  =  current_fsgid ( ) ,  } ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( uid_eq ( iattr . ia_uid ,  GLOBAL_ROOT_UID )  & & 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									    gid_eq ( iattr . ia_gid ,  GLOBAL_ROOT_GID ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  kernfs_setattr ( kn ,  & iattr ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_add_file ( struct  cgroup  * cgrp ,  struct  cftype  * cft )  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									char  name [ CGROUP_FILE_NAME_MAX ] ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  kernfs_node  * kn ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  lock_class_key  * key  =  NULL ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-07 16:44:47 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  ret ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# ifdef CONFIG_DEBUG_LOCK_ALLOC 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									key  =  & cft - > lockdep_key ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# endif 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									kn  =  __kernfs_create_file ( cgrp - > kn ,  cgroup_file_name ( cgrp ,  cft ,  name ) , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												  cgroup_file_mode ( cft ) ,  0 ,  cft - > kf_ops ,  cft , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												  NULL ,  false ,  key ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-07 16:44:47 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( IS_ERR ( kn ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  PTR_ERR ( kn ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									ret  =  cgroup_kn_set_ugid ( kn ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ret )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-07 16:44:47 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										kernfs_remove ( kn ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  ret ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( cft - > seq_show  = =  cgroup_populated_show ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										cgrp - > populated_kn  =  kn ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 16:24:10 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_addrm_files  -  add  or  remove  files  to  a  cgroup  directory 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cgrp :  the  target  cgroup 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cfts :  array  of  cftypes  to  be  added 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ is_add :  whether  to  add  or  remove 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Depending  on  @ is_add ,  add  or  remove  files  defined  by  @ cfts  on  @ cgrp . 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  For  removals ,  this  function  never  fails .   If  addition  fails ,  this 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  function  doesn ' t  remove  files  already  added .   The  caller  is  responsible 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  for  cleaning  up . 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 16:24:10 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_addrm_files ( struct  cgroup  * cgrp ,  struct  cftype  cfts [ ] ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											      bool  is_add ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add xattr support
This is one of the items in the plumber's wish list.
For use cases:
>> What would the use case be for this?
>
> Attaching meta information to services, in an easily discoverable
> way. For example, in systemd we create one cgroup for each service, and
> could then store data like the main pid of the specific service as an
> xattr on the cgroup itself. That way we'd have almost all service state
> in the cgroupfs, which would make it possible to terminate systemd and
> later restart it without losing any state information. But there's more:
> for example, some very peculiar services cannot be terminated on
> shutdown (i.e. fakeraid DM stuff) and it would be really nice if the
> services in question could just mark that on their cgroup, by setting an
> xattr. On the more desktopy side of things there are other
> possibilities: for example there are plans defining what an application
> is along the lines of a cgroup (i.e. an app being a collection of
> processes). With xattrs one could then attach an icon or human readable
> program name on the cgroup.
>
> The key idea is that this would allow attaching runtime meta information
> to cgroups and everything they model (services, apps, vms), that doesn't
> need any complex userspace infrastructure, has good access control
> (i.e. because the file system enforces that anyway, and there's the
> "trusted." xattr namespace), notifications (inotify), and can easily be
> shared among applications.
>
> Lennart
v7:
- no changes
v6:
- remove user xattr namespace, only allow trusted and security
v5:
- check for capabilities before setting/removing xattrs
v4:
- no changes
v3:
- instead of config option, use mount option to enable xattr support
Original-patch-by: Li Zefan <lizefan@huawei.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
											 
										 
										
											2012-08-23 16:53:30 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cftype  * cft ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 16:24:10 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  ret ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									for  ( cft  =  cfts ;  cft - > name [ 0 ]  ! =  ' \0 ' ;  cft + + )  { 
							 
						 
					
						
							
								
									
										
										
										
											2012-12-06 14:38:57 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										/* does cft->flags tell us to skip this file on @cgrp? */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:55 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ( cft - > flags  &  CFTYPE_ONLY_ON_DFL )  & &  ! cgroup_on_dfl ( cgrp ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce sane_behavior mount option
It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.
As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility.  This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors.  As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.
Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default.  Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.
This patch introduces the mount option and changes the following
behaviors in cgroup core.
* Mount options "noprefix" and "clone_children" are disallowed.  Also,
  cgroupfs file cgroup.clone_children is not created.
* When mounting an existing superblock, mount options should match.
  This is currently pretty crazy.  If one mounts a cgroup, creates a
  subdirectory, unmounts it and then mount it again with different
  option, it looks like the new options are applied but they aren't.
* Remount is disallowed.
The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.
v2: Dropped unnecessary explicit file permission setting sane_behavior
    cftype entry as suggested by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
											 
										 
										
											2013-04-14 20:15:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ( cft - > flags  &  CFTYPE_INSANE )  & &  cgroup_sane_behavior ( cgrp ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ( cft - > flags  &  CFTYPE_NOT_ON_ROOT )  & &  ! cgroup_parent ( cgrp ) ) 
							 
						 
					
						
							
								
									
										
										
										
											2012-12-06 14:38:57 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ( cft - > flags  &  CFTYPE_ONLY_ON_ROOT )  & &  cgroup_parent ( cgrp ) ) 
							 
						 
					
						
							
								
									
										
										
										
											2012-12-06 14:38:57 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-01-21 18:18:33 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( is_add )  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											ret  =  cgroup_add_file ( cgrp ,  cft ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 16:24:10 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											if  ( ret )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:03 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												pr_warn ( " %s: failed to add %s, err=%d \n " , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													__func__ ,  cft - > name ,  ret ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 16:24:10 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												return  ret ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											} 
							 
						 
					
						
							
								
									
										
										
										
											2013-01-21 18:18:33 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										}  else  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											cgroup_rm_file ( cgrp ,  cft ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 16:24:10 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_apply_cftypes ( struct  cftype  * cfts ,  bool  is_add )  
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									LIST_HEAD ( pending ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss  =  cfts [ 0 ] . ss ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * root  =  & ss - > root - > cgrp ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * css ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 16:24:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  ret  =  0 ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-18 18:48:37 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* add/rm files for all cgroups created before */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-26 18:40:56 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									css_for_each_descendant_pre ( css ,  cgroup_css ( root ,  ss ) )  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										struct  cgroup  * cgrp  =  css - > cgroup ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-18 18:48:37 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( cgroup_is_dead ( cgrp ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										ret  =  cgroup_addrm_files ( cgrp ,  cfts ,  is_add ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 16:24:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ret ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											break ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( is_add  & &  ! ret ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										kernfs_activate ( root - > kn ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 16:24:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  ret ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  cgroup_exit_cftypes ( struct  cftype  * cfts )  
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cftype  * cft ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for  ( cft  =  cfts ;  cft - > name [ 0 ]  ! =  ' \0 ' ;  cft + + )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/* free copy for custom atomic_write_len, see init_cftypes() */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( cft - > max_write_len  & &  cft - > max_write_len  ! =  PAGE_SIZE ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											kfree ( cft - > kf_ops ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cft - > kf_ops  =  NULL ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										cft - > ss  =  NULL ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_init_cftypes ( struct  cgroup_subsys  * ss ,  struct  cftype  * cfts )  
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cftype  * cft ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for  ( cft  =  cfts ;  cft - > name [ 0 ]  ! =  ' \0 ' ;  cft + + )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										struct  kernfs_ops  * kf_ops ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										WARN_ON ( cft - > ss  | |  cft - > kf_ops ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( cft - > seq_start ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											kf_ops  =  & cgroup_kf_ops ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										else 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											kf_ops  =  & cgroup_kf_single_ops ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  Ugh . . .  if  @ cft  wants  a  custom  max_write_len ,  we  need  to 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  make  a  copy  of  kf_ops  to  set  its  atomic_write_len . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( cft - > max_write_len  & &  cft - > max_write_len  ! =  PAGE_SIZE )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											kf_ops  =  kmemdup ( kf_ops ,  sizeof ( * kf_ops ) ,  GFP_KERNEL ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( ! kf_ops )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												cgroup_exit_cftypes ( cfts ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												return  - ENOMEM ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											kf_ops - > atomic_write_len  =  cft - > max_write_len ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										cft - > kf_ops  =  kf_ops ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										cft - > ss  =  ss ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_rm_cftypes_locked ( struct  cftype  * cfts )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! cfts  | |  ! cfts [ 0 ] . ss ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  - ENOENT ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									list_del ( & cfts - > node ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgroup_apply_cftypes ( cfts ,  false ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgroup_exit_cftypes ( cfts ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_rm_cftypes  -  remove  an  array  of  cftypes  from  a  subsystem 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cfts :  zero - length  name  terminated  array  of  cftypes 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  Unregister  @ cfts .   Files  described  by  @ cfts  are  removed  from  all 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  existing  cgroups  and  all  future  cgroups  won ' t  have  them  either .   This 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  function  can  be  called  anytime  whether  @ cfts '  subsys  is  attached  or  not . 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Returns  0  on  successful  unregistration ,  - ENOENT  if  @ cfts  is  not 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  registered . 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								int  cgroup_rm_cftypes ( struct  cftype  * cfts )  
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  ret ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_lock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									ret  =  cgroup_rm_cftypes_locked ( cfts ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_unlock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  ret ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_add_cftypes  -  add  an  array  of  cftypes  to  a  subsystem 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ ss :  target  cgroup  subsystem 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cfts :  zero - length  name  terminated  array  of  cftypes 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Register  @ cfts  to  @ ss .   Files  described  by  @ cfts  are  created  for  all 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  existing  cgroups  to  which  @ ss  is  attached  and  all  future  cgroups  will 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  have  them  too .   This  function  can  be  called  anytime  whether  @ ss  is 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  attached  or  not . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Returns  0  on  successful  registration ,  - errno  on  failure .   Note  that  this 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  function  currently  returns  0  as  long  as  @ cfts  registration  is  successful 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  even  if  some  file  creation  attempts  on  existing  cgroups  fail . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add xattr support
This is one of the items in the plumber's wish list.
For use cases:
>> What would the use case be for this?
>
> Attaching meta information to services, in an easily discoverable
> way. For example, in systemd we create one cgroup for each service, and
> could then store data like the main pid of the specific service as an
> xattr on the cgroup itself. That way we'd have almost all service state
> in the cgroupfs, which would make it possible to terminate systemd and
> later restart it without losing any state information. But there's more:
> for example, some very peculiar services cannot be terminated on
> shutdown (i.e. fakeraid DM stuff) and it would be really nice if the
> services in question could just mark that on their cgroup, by setting an
> xattr. On the more desktopy side of things there are other
> possibilities: for example there are plans defining what an application
> is along the lines of a cgroup (i.e. an app being a collection of
> processes). With xattrs one could then attach an icon or human readable
> program name on the cgroup.
>
> The key idea is that this would allow attaching runtime meta information
> to cgroups and everything they model (services, apps, vms), that doesn't
> need any complex userspace infrastructure, has good access control
> (i.e. because the file system enforces that anyway, and there's the
> "trusted." xattr namespace), notifications (inotify), and can easily be
> shared among applications.
>
> Lennart
v7:
- no changes
v6:
- remove user xattr namespace, only allow trusted and security
v5:
- check for capabilities before setting/removing xattrs
v4:
- no changes
v3:
- instead of config option, use mount option to enable xattr support
Original-patch-by: Li Zefan <lizefan@huawei.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
											 
										 
										
											2012-08-23 16:53:30 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								int  cgroup_add_cftypes ( struct  cgroup_subsys  * ss ,  struct  cftype  * cfts )  
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-06-28 16:24:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  ret ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-06-05 17:16:30 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ss - > disabled ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-17 10:41:50 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! cfts  | |  cfts [ 0 ] . name [ 0 ]  = =  ' \0 ' ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  0 ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									ret  =  cgroup_init_cftypes ( ss ,  cfts ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ret ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  ret ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_lock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									list_add_tail ( & cfts - > node ,  & ss - > cfts ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									ret  =  cgroup_apply_cftypes ( cfts ,  true ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 16:24:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ret ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										cgroup_rm_cftypes_locked ( cfts ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_unlock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 16:24:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  ret ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2008-02-23 15:24:09 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_task_count  -  count  the  number  of  tasks  in  a  cgroup . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cgrp :  the  cgroup  in  question 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Return  the  number  of  tasks  in  the  cgroup . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:39 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_task_count ( const  struct  cgroup  * cgrp )  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									int  count  =  0 ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgrp_cset_link  * link ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									down_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									list_for_each_entry ( link ,  & cgrp - > cset_links ,  cset_link ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										count  + =  atomic_read ( & link - > cset - > refcount ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									up_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  count ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section.  This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive.  This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible.  A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling.  IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next.  This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order.  A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
											 
										 
										
											2013-05-24 10:55:38 +09:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  css_next_child  -  find  the  next  child  of  a  given  css 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ pos :  the  current  position  ( % NULL  to  initiate  traversal ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ parent :  css  whose  children  to  walk 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section.  This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive.  This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible.  A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling.  IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next.  This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order.  A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
											 
										 
										
											2013-05-24 10:55:38 +09:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  This  function  returns  the  next  child  of  @ parent  and  should  be  called 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:55 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  under  either  cgroup_mutex  or  RCU  read  lock .   The  only  requirement  is 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  that  @ parent  and  @ pos  are  accessible .   The  next  sibling  is  guaranteed  to 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  be  returned  regardless  of  their  states . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  If  a  subsystem  synchronizes  - > css_online ( )  and  the  start  of  iteration ,  a 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  css  which  finished  - > css_online ( )  is  guaranteed  to  be  visible  in  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  future  iterations  and  will  stay  visible  until  the  last  reference  is  put . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  A  css  which  hasn ' t  finished  - > css_online ( )  or  already  finished 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  - > css_offline ( )  may  show  up  during  traversal .   It ' s  each  subsystem ' s 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  responsibility  to  synchronize  against  on / offlining . 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section.  This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive.  This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible.  A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling.  IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next.  This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order.  A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
											 
										 
										
											2013-05-24 10:55:38 +09:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								struct  cgroup_subsys_state  * css_next_child ( struct  cgroup_subsys_state  * pos ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													   struct  cgroup_subsys_state  * parent ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section.  This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive.  This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible.  A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling.  IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next.  This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order.  A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
											 
										 
										
											2013-05-24 10:55:38 +09:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * next ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section.  This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive.  This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible.  A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling.  IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next.  This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order.  A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
											 
										 
										
											2013-05-24 10:55:38 +09:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_assert_mutex_or_rcu_locked ( ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section.  This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive.  This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible.  A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling.  IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next.  This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order.  A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
											 
										 
										
											2013-05-24 10:55:38 +09:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:49 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  @ pos  could  already  have  been  unlinked  from  the  sibling  list . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Once  a  cgroup  is  removed ,  its  - > sibling . next  is  no  longer 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  updated  when  its  next  sibling  changes .   CSS_RELEASED  is  set  when 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  @ pos  is  taken  off  list ,  at  which  time  its  next  pointer  is  valid , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  and ,  as  releases  are  serialized ,  the  one  pointed  to  by  the  next 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  pointer  is  guaranteed  to  not  have  started  release  yet .   This 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  implies  that  if  we  observe  ! CSS_RELEASED  on  @ pos  in  this  RCU 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  critical  section ,  the  one  pointed  to  by  its  next  pointer  is 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  guaranteed  to  not  have  finished  its  RCU  grace  period  even  if  we 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  have  dropped  rcu_read_lock ( )  inbetween  iterations . 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 * 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:49 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  If  @ pos  has  CSS_RELEASED  set ,  its  next  pointer  can ' t  be 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  dereferenced ;  however ,  as  each  css  is  given  a  monotonically 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  increasing  unique  serial  number  and  always  appended  to  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  sibling  list ,  the  next  one  can  be  found  by  walking  the  parent ' s 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  children  until  the  first  css  with  higher  serial  number  than 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  @ pos ' s .   While  this  path  can  be  slower ,  it  happens  iff  iteration 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  races  against  release  and  the  race  window  is  very  small . 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section.  This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive.  This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible.  A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling.  IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next.  This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order.  A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
											 
										 
										
											2013-05-24 10:55:38 +09:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! pos )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										next  =  list_entry_rcu ( parent - > children . next ,  struct  cgroup_subsys_state ,  sibling ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									}  else  if  ( likely ( ! ( pos - > flags  &  CSS_RELEASED ) ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										next  =  list_entry_rcu ( pos - > sibling . next ,  struct  cgroup_subsys_state ,  sibling ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									}  else  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										list_for_each_entry_rcu ( next ,  & parent - > children ,  sibling ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											if  ( next - > serial_nr  >  pos - > serial_nr ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												break ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section.  This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive.  This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible.  A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling.  IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next.  This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order.  A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
											 
										 
										
											2013-05-24 10:55:38 +09:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  @ next ,  if  not  pointing  to  the  head ,  can  be  dereferenced  and  is 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  the  next  sibling . 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( & next - > sibling  ! =  & parent - > children ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  next ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  NULL ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section.  This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive.  This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible.  A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling.  IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next.  This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order.  A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
											 
										 
										
											2013-05-24 10:55:38 +09:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  css_next_descendant_pre  -  find  the  next  descendant  for  pre - order  walk 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ pos :  the  current  position  ( % NULL  to  initiate  traversal ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ root :  css  whose  descendants  to  walk 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  To  be  used  by  css_for_each_descendant_pre ( ) .   Find  the  next  descendant 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:27 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  to  visit  for  pre - order  traversal  of  @ root ' s  descendants .   @ root  is 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  included  in  the  iteration  and  the  first  node  to  be  visited . 
							 
						 
					
						
							
								
									
										
										
										
											2013-05-24 10:55:38 +09:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:55 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  While  this  function  requires  cgroup_mutex  or  RCU  read  locking ,  it 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  doesn ' t  require  the  whole  traversal  to  be  contained  in  a  single  critical 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  section .   This  function  will  return  the  correct  next  descendant  as  long 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  as  both  @ pos  and  @ root  are  accessible  and  @ pos  is  a  descendant  of  @ root . 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  If  a  subsystem  synchronizes  - > css_online ( )  and  the  start  of  iteration ,  a 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  css  which  finished  - > css_online ( )  is  guaranteed  to  be  visible  in  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  future  iterations  and  will  stay  visible  until  the  last  reference  is  put . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  A  css  which  hasn ' t  finished  - > css_online ( )  or  already  finished 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  - > css_offline ( )  may  show  up  during  traversal .   It ' s  each  subsystem ' s 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  responsibility  to  synchronize  against  on / offlining . 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								struct  cgroup_subsys_state  *  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								css_next_descendant_pre ( struct  cgroup_subsys_state  * pos ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											struct  cgroup_subsys_state  * root ) 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * next ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_assert_mutex_or_rcu_locked ( ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:27 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* if first iteration, visit @root */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-05-24 10:50:24 +09:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! pos ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:27 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  root ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* visit the first child if exists */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									next  =  css_next_child ( NULL ,  pos ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( next ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  next ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* no child, visit my or the closest ancestor's next sibling */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									while  ( pos  ! =  root )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										next  =  css_next_child ( pos ,  pos - > parent ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-05-24 10:55:38 +09:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( next ) 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											return  next ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										pos  =  pos - > parent ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-05-24 10:50:24 +09:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  NULL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-01-07 08:49:33 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  css_rightmost_descendant  -  return  the  rightmost  descendant  of  a  css 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ pos :  css  of  interest 
							 
						 
					
						
							
								
									
										
										
										
											2013-01-07 08:49:33 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  Return  the  rightmost  descendant  of  @ pos .   If  there ' s  no  descendant ,  @ pos 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  is  returned .   This  can  be  used  during  pre - order  traversal  to  skip 
							 
						 
					
						
							
								
									
										
										
										
											2013-01-07 08:49:33 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  subtree  of  @ pos . 
							 
						 
					
						
							
								
									
										
										
										
											2013-05-24 10:55:38 +09:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:55 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  While  this  function  requires  cgroup_mutex  or  RCU  read  locking ,  it 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  doesn ' t  require  the  whole  traversal  to  be  contained  in  a  single  critical 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  section .   This  function  will  return  the  correct  rightmost  descendant  as 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  long  as  @ pos  is  accessible . 
							 
						 
					
						
							
								
									
										
										
										
											2013-01-07 08:49:33 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								struct  cgroup_subsys_state  *  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								css_rightmost_descendant ( struct  cgroup_subsys_state  * pos )  
						 
					
						
							
								
									
										
										
										
											2013-01-07 08:49:33 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * last ,  * tmp ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-01-07 08:49:33 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_assert_mutex_or_rcu_locked ( ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-01-07 08:49:33 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									do  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										last  =  pos ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/* ->prev isn't RCU safe, walk ->next till the end */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										pos  =  NULL ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										css_for_each_child ( tmp ,  last ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-01-07 08:49:33 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											pos  =  tmp ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									}  while  ( pos ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  last ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  struct  cgroup_subsys_state  *  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								css_leftmost_descendant ( struct  cgroup_subsys_state  * pos )  
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * last ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									do  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										last  =  pos ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										pos  =  css_next_child ( NULL ,  pos ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									}  while  ( pos ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  last ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  css_next_descendant_post  -  find  the  next  descendant  for  post - order  walk 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ pos :  the  current  position  ( % NULL  to  initiate  traversal ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ root :  css  whose  descendants  to  walk 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  To  be  used  by  css_for_each_descendant_post ( ) .   Find  the  next  descendant 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:27 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  to  visit  for  post - order  traversal  of  @ root ' s  descendants .   @ root  is 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  included  in  the  iteration  and  the  last  node  to  be  visited . 
							 
						 
					
						
							
								
									
										
										
										
											2013-05-24 10:55:38 +09:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:55 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  While  this  function  requires  cgroup_mutex  or  RCU  read  locking ,  it 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  doesn ' t  require  the  whole  traversal  to  be  contained  in  a  single  critical 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  section .   This  function  will  return  the  correct  next  descendant  as  long 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  as  both  @ pos  and  @ cgroup  are  accessible  and  @ pos  is  a  descendant  of 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cgroup . 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  If  a  subsystem  synchronizes  - > css_online ( )  and  the  start  of  iteration ,  a 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  css  which  finished  - > css_online ( )  is  guaranteed  to  be  visible  in  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  future  iterations  and  will  stay  visible  until  the  last  reference  is  put . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  A  css  which  hasn ' t  finished  - > css_online ( )  or  already  finished 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  - > css_offline ( )  may  show  up  during  traversal .   It ' s  each  subsystem ' s 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  responsibility  to  synchronize  against  on / offlining . 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								struct  cgroup_subsys_state  *  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								css_next_descendant_post ( struct  cgroup_subsys_state  * pos ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 struct  cgroup_subsys_state  * root ) 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * next ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_assert_mutex_or_rcu_locked ( ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-09-06 15:31:08 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* if first iteration, visit leftmost descendant which may be @root */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! pos ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  css_leftmost_descendant ( root ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:27 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* if we visited @root, we're done */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( pos  = =  root ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  NULL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* if there's an unvisited sibling, visit its leftmost descendant */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									next  =  css_next_child ( pos ,  pos - > parent ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-05-24 10:55:38 +09:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( next ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:25 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  css_leftmost_descendant ( next ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* no sibling left, visit parent */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  pos - > parent ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:52 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  css_has_online_children  -  does  a  css  have  online  children 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ css :  the  target  css 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Returns  % true  if  @ css  has  any  online  children ;  otherwise ,  % false .   This 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  function  can  be  called  from  any  context  but  the  caller  is  responsible 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  for  synchronizing  against  on / offlining  as  necessary . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								bool  css_has_online_children ( struct  cgroup_subsys_state  * css )  
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:01 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:52 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * child ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									bool  ret  =  false ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:01 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									rcu_read_lock ( ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:52 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									css_for_each_child ( child ,  css )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( css - > flags  &  CSS_ONLINE )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											ret  =  true ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											break ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:01 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									rcu_read_unlock ( ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:52 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  ret ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  css_advance_task_iter  -  advance  a  task  itererator  to  the  next  css_set 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ it :  the  iterator  to  advance 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Advance  @ it  to  the  next  css_set  to  walk . 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  css_advance_task_iter ( struct  css_task_iter  * it )  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  list_head  * l  =  it - > cset_pos ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgrp_cset_link  * link ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  css_set  * cset ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* Advance to the next non-empty css_set */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									do  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										l  =  l - > next ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( l  = =  it - > cset_head )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											it - > cset_pos  =  NULL ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											return ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( it - > ss )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											cset  =  container_of ( l ,  struct  css_set , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													    e_cset_node [ it - > ss - > id ] ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										}  else  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											link  =  list_entry ( l ,  struct  cgrp_cset_link ,  cset_link ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											cset  =  link - > cset ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add css_set->mg_tasks
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, we're going to use task->cg_list during
migration too.  Instead of building a separate array, target tasks
will be linked into a dedicated migration list_head on the owning
css_set.  Tasks on the migration list are treated the same as tasks on
the usual tasks list; however, being on a separate list allows cgroup
migration code path to keep track of the target tasks by simply
keeping the list of css_sets with tasks being migrated, making
unpredictable dynamic allocation unnecessary.
In prepartion of such migration path update, this patch introduces
css_set->mg_tasks list and updates css_set task iterations so that
they walk both css_set->tasks and ->mg_tasks.  Note that ->mg_tasks
isn't used yet.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									}  while  ( list_empty ( & cset - > tasks )  & &  list_empty ( & cset - > mg_tasks ) ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									it - > cset_pos  =  l ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add css_set->mg_tasks
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, we're going to use task->cg_list during
migration too.  Instead of building a separate array, target tasks
will be linked into a dedicated migration list_head on the owning
css_set.  Tasks on the migration list are treated the same as tasks on
the usual tasks list; however, being on a separate list allows cgroup
migration code path to keep track of the target tasks by simply
keeping the list of css_sets with tasks being migrated, making
unpredictable dynamic allocation unnecessary.
In prepartion of such migration path update, this patch introduces
css_set->mg_tasks list and updates css_set task iterations so that
they walk both css_set->tasks and ->mg_tasks.  Note that ->mg_tasks
isn't used yet.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! list_empty ( & cset - > tasks ) ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										it - > task_pos  =  cset - > tasks . next ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add css_set->mg_tasks
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, we're going to use task->cg_list during
migration too.  Instead of building a separate array, target tasks
will be linked into a dedicated migration list_head on the owning
css_set.  Tasks on the migration list are treated the same as tasks on
the usual tasks list; however, being on a separate list allows cgroup
migration code path to keep track of the target tasks by simply
keeping the list of css_sets with tasks being migrated, making
unpredictable dynamic allocation unnecessary.
In prepartion of such migration path update, this patch introduces
css_set->mg_tasks list and updates css_set task iterations so that
they walk both css_set->tasks and ->mg_tasks.  Note that ->mg_tasks
isn't used yet.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									else 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										it - > task_pos  =  cset - > mg_tasks . next ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									it - > tasks_head  =  & cset - > tasks ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									it - > mg_tasks_head  =  & cset - > mg_tasks ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  css_task_iter_start  -  initiate  task  iteration 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ css :  the  css  to  walk  tasks  of 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ it :  the  task  iterator  to  use 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  Initiate  iteration  through  the  tasks  of  @ css .   The  caller  can  call 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  css_task_iter_next ( )  to  walk  through  the  tasks  until  the  function 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  returns  NULL .   On  completion  of  iteration ,  css_task_iter_end ( )  must  be 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  called . 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Note  that  this  function  acquires  a  lock  which  is  released  when  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  iteration  finishes .   The  caller  can ' t  sleep  while  iteration  is  in 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  progress . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								void  css_task_iter_start ( struct  cgroup_subsys_state  * css ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 struct  css_task_iter  * it ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									__acquires ( css_set_rwsem ) 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:38 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* no one should try to iterate before mounting cgroups */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									WARN_ON_ONCE ( ! use_task_css_set_links ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-02-07 00:14:42 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									down_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									it - > ss  =  css - > ss ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( it - > ss ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										it - > cset_pos  =  & css - > cgroup - > e_csets [ css - > ss - > id ] ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									else 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										it - > cset_pos  =  & css - > cgroup - > cset_links ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									it - > cset_head  =  it - > cset_pos ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									css_advance_task_iter ( it ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  css_task_iter_next  -  return  the  next  task  for  the  iterator 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ it :  the  task  iterator  being  iterated 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  The  " next "  function  for  task  iteration .   @ it  should  have  been 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  initialized  via  css_task_iter_start ( ) .   Returns  NULL  when  the  iteration 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  reaches  the  end . 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								struct  task_struct  * css_task_iter_next ( struct  css_task_iter  * it )  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  task_struct  * res ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  list_head  * l  =  it - > task_pos ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* If the iterator cg is NULL, we have no tasks */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! it - > cset_pos ) 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  NULL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									res  =  list_entry ( l ,  struct  task_struct ,  cg_list ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add css_set->mg_tasks
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, we're going to use task->cg_list during
migration too.  Instead of building a separate array, target tasks
will be linked into a dedicated migration list_head on the owning
css_set.  Tasks on the migration list are treated the same as tasks on
the usual tasks list; however, being on a separate list allows cgroup
migration code path to keep track of the target tasks by simply
keeping the list of css_sets with tasks being migrated, making
unpredictable dynamic allocation unnecessary.
In prepartion of such migration path update, this patch introduces
css_set->mg_tasks list and updates css_set task iterations so that
they walk both css_set->tasks and ->mg_tasks.  Note that ->mg_tasks
isn't used yet.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Advance  iterator  to  find  next  entry .   cset - > tasks  is  consumed 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  first  and  then  - > mg_tasks .   After  - > mg_tasks ,  we  move  onto  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  next  cset . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									l  =  l - > next ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add css_set->mg_tasks
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, we're going to use task->cg_list during
migration too.  Instead of building a separate array, target tasks
will be linked into a dedicated migration list_head on the owning
css_set.  Tasks on the migration list are treated the same as tasks on
the usual tasks list; however, being on a separate list allows cgroup
migration code path to keep track of the target tasks by simply
keeping the list of css_sets with tasks being migrated, making
unpredictable dynamic allocation unnecessary.
In prepartion of such migration path update, this patch introduces
css_set->mg_tasks list and updates css_set task iterations so that
they walk both css_set->tasks and ->mg_tasks.  Note that ->mg_tasks
isn't used yet.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( l  = =  it - > tasks_head ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										l  =  it - > mg_tasks_head - > next ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add css_set->mg_tasks
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, we're going to use task->cg_list during
migration too.  Instead of building a separate array, target tasks
will be linked into a dedicated migration list_head on the owning
css_set.  Tasks on the migration list are treated the same as tasks on
the usual tasks list; however, being on a separate list allows cgroup
migration code path to keep track of the target tasks by simply
keeping the list of css_sets with tasks being migrated, making
unpredictable dynamic allocation unnecessary.
In prepartion of such migration path update, this patch introduces
css_set->mg_tasks list and updates css_set task iterations so that
they walk both css_set->tasks and ->mg_tasks.  Note that ->mg_tasks
isn't used yet.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( l  = =  it - > mg_tasks_head ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										css_advance_task_iter ( it ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add css_set->mg_tasks
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, we're going to use task->cg_list during
migration too.  Instead of building a separate array, target tasks
will be linked into a dedicated migration list_head on the owning
css_set.  Tasks on the migration list are treated the same as tasks on
the usual tasks list; however, being on a separate list allows cgroup
migration code path to keep track of the target tasks by simply
keeping the list of css_sets with tasks being migrated, making
unpredictable dynamic allocation unnecessary.
In prepartion of such migration path update, this patch introduces
css_set->mg_tasks list and updates css_set task iterations so that
they walk both css_set->tasks and ->mg_tasks.  Note that ->mg_tasks
isn't used yet.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									else 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										it - > task_pos  =  l ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add css_set->mg_tasks
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, we're going to use task->cg_list during
migration too.  Instead of building a separate array, target tasks
will be linked into a dedicated migration list_head on the owning
css_set.  Tasks on the migration list are treated the same as tasks on
the usual tasks list; however, being on a separate list allows cgroup
migration code path to keep track of the target tasks by simply
keeping the list of css_sets with tasks being migrated, making
unpredictable dynamic allocation unnecessary.
In prepartion of such migration path update, this patch introduces
css_set->mg_tasks list and updates css_set task iterations so that
they walk both css_set->tasks and ->mg_tasks.  Note that ->mg_tasks
isn't used yet.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  res ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  css_task_iter_end  -  finish  task  iteration 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ it :  the  task  iterator  to  finish 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  Finish  task  iteration  started  by  css_task_iter_start ( ) . 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								void  css_task_iter_end ( struct  css_task_iter  * it )  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									__releases ( css_set_rwsem ) 
							 
						 
					
						
							
								
									
										
										
										
											2008-02-07 00:14:42 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									up_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-02-07 00:14:42 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
										
										
											2013-04-07 09:29:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  cgroup_trasnsfer_tasks  -  move  tasks  from  one  cgroup  to  another 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ to :  cgroup  to  which  the  tasks  will  be  moved 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ from :  cgroup  in  which  the  tasks  currently  reside 
							 
						 
					
						
							
								
									
										
										
										
											2008-02-07 00:14:42 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  Locking  rules  between  cgroup_post_fork ( )  and  the  migration  path 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  guarantee  that ,  if  a  task  is  forking  while  being  migrated ,  the  new  child 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  is  guaranteed  to  be  either  visible  in  the  source  cgroup  after  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  parent ' s  migration  is  complete  or  put  into  the  target  cgroup .   No  task 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  can  slip  out  of  migration  through  forking . 
							 
						 
					
						
							
								
									
										
										
										
											2008-02-07 00:14:42 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-07 09:29:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								int  cgroup_transfer_tasks ( struct  cgroup  * to ,  struct  cgroup  * from )  
						 
					
						
							
								
									
										
										
										
											2008-02-07 00:14:42 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									LIST_HEAD ( preloaded_csets ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgrp_cset_link  * link ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  css_task_iter  it ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:39 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  task_struct  * task ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  ret ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-02-07 00:14:42 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_lock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-02-07 00:14:42 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* all tasks in @from are being moved, all csets are source */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									down_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									list_for_each_entry ( link ,  & from - > cset_links ,  cset_link ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgroup_migrate_add_src ( link - > cset ,  to ,  & preloaded_csets ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									up_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-02-07 00:14:42 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									ret  =  cgroup_migrate_prepare_dst ( to ,  & preloaded_csets ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ret ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										goto  out_err ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-07 09:29:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Migrate  tasks  one - by - one  until  @ form  is  empty .   This  fails  iff 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  - > can_attach ( )  fails . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:39 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									do  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:00 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										css_task_iter_start ( & from - > self ,  & it ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:39 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										task  =  css_task_iter_next ( & it ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( task ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											get_task_struct ( task ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										css_task_iter_end ( & it ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( task )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											ret  =  cgroup_migrate ( to ,  task ,  false ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:39 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											put_task_struct ( task ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									}  while  ( task  & &  ! ret ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								out_err :  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgroup_migrate_finish ( & preloaded_csets ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-07 09:29:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_unlock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:39 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  ret ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-07 09:29:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  Stuff  for  reading  the  ' tasks ' / ' procs '  files . 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Reading  this  file  can  return  large  amounts  of  data  if  a  cgroup  has 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  * lots *  of  attached  tasks .  So  it  may  need  several  calls  to  read ( ) , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  but  we  cannot  guarantee  that  the  information  we  produce  is  correct 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  unless  we  produce  it  entirely  atomically . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-01-20 11:58:43 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/* which pidlist file are we talking about? */  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								enum  cgroup_filetype  {  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									CGROUP_FILE_PROCS , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									CGROUP_FILE_TASKS , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								} ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  A  pidlist  is  a  list  of  pids  that  virtually  represents  the  contents  of  one 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  of  the  cgroup  files  ( " procs "  or  " tasks " ) .  We  keep  a  list  of  such  pidlists , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  a  pair  ( one  each  for  procs ,  tasks )  for  each  pid  namespace  that ' s  relevant 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  to  the  cgroup . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								struct  cgroup_pidlist  {  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  used  to  find  which  pidlist  is  wanted .  doesn ' t  change  as  long  as 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  this  particular  list  stays  in  the  list . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									*/ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  {  enum  cgroup_filetype  type ;  struct  pid_namespace  * ns ;  }  key ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* array of xids */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									pid_t  * list ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* how many elements the above list has */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									int  length ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* each of these stored in a list by its cgroup */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  list_head  links ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* pointer to the cgroup we belong to, for list removal purposes */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup  * owner ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* for delayed destruction */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  delayed_work  destroy_dwork ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-01-20 11:58:43 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								} ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:28 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  The  following  two  functions  " fix "  the  issue  where  there  are  more  pids 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  than  kmalloc  will  give  memory  for ;  in  such  cases ,  we  use  vmalloc / vfree . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  TODO :  replace  with  a  kernel - wide  solution  to  this  problem 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# define PIDLIST_TOO_LARGE(c) ((c) * sizeof(pid_t) > (PAGE_SIZE * 2)) 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  void  * pidlist_allocate ( int  count )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( PIDLIST_TOO_LARGE ( count ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  vmalloc ( count  *  sizeof ( pid_t ) ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									else 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  kmalloc ( count  *  sizeof ( pid_t ) ,  GFP_KERNEL ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:28 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  pidlist_free ( void  * p )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( is_vmalloc_addr ( p ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										vfree ( p ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									else 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										kfree ( p ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Used  to  destroy  all  pidlists  lingering  waiting  for  destroy  timer .   None 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  should  be  left  afterwards . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  void  cgroup_pidlist_destroy_all ( struct  cgroup  * cgrp )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup_pidlist  * l ,  * tmp_l ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									mutex_lock ( & cgrp - > pidlist_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									list_for_each_entry_safe ( l ,  tmp_l ,  & cgrp - > pidlists ,  links ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										mod_delayed_work ( cgroup_pidlist_destroy_wq ,  & l - > destroy_dwork ,  0 ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									mutex_unlock ( & cgrp - > pidlist_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									flush_workqueue ( cgroup_pidlist_destroy_wq ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									BUG_ON ( ! list_empty ( & cgrp - > pidlists ) ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  void  cgroup_pidlist_destroy_work_fn ( struct  work_struct  * work )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  delayed_work  * dwork  =  to_delayed_work ( work ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup_pidlist  * l  =  container_of ( dwork ,  struct  cgroup_pidlist , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
														destroy_dwork ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup_pidlist  * tofree  =  NULL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									mutex_lock ( & l - > owner - > pidlist_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  Destroy  iff  we  didn ' t  get  queued  again .   The  state  won ' t  change 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  as  destroy_dwork  can  only  be  queued  while  locked . 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! delayed_work_pending ( dwork ) )  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										list_del ( & l - > links ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										pidlist_free ( l - > list ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										put_pid_ns ( l - > key . ns ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										tofree  =  l ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									mutex_unlock ( & l - > owner - > pidlist_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									kfree ( tofree ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  pidlist_uniq  -  given  a  kmalloc ( ) ed  list ,  strip  out  all  duplicate  entries 
							 
						 
					
						
							
								
									
										
										
										
											2013-03-12 15:36:00 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  Returns  the  number  of  unique  elements . 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-03-12 15:36:00 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  pidlist_uniq ( pid_t  * list ,  int  length )  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  src ,  dest  =  1 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  we  presume  the  0 th  element  is  unique ,  so  i  starts  at  1.  trivial 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  edge  cases  first ;  no  work  needs  to  be  done  for  either 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( length  = =  0  | |  length  = =  1 ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  length ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* src and dest walk down the list; dest counts unique elements */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									for  ( src  =  1 ;  src  <  length ;  src + + )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/* find next unique element */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										while  ( list [ src ]  = =  list [ src - 1 ] )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											src + + ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( src  = =  length ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												goto  after ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/* dest always points to where the next unique element goes */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										list [ dest ]  =  list [ src ] ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										dest + + ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								after :  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  dest ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  The  two  pid  files  -  task  and  cgroup . procs  -  guaranteed  that  the  result 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  is  sorted ,  which  forced  this  whole  pidlist  fiasco .   As  pid  order  is 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  different  per  namespace ,  each  namespace  needs  differently  sorted  list , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  making  it  impossible  to  use ,  for  example ,  single  rbtree  of  member  tasks 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  sorted  by  task  pointer .   As  pidlists  can  be  fairly  large ,  allocating  one 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  per  open  file  is  dangerous ,  so  cgroup  had  to  implement  shared  pool  of 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  pidlists  keyed  by  cgroup  and  namespace . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  All  this  extra  complexity  was  caused  by  the  original  implementation 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  committing  to  an  entirely  unnecessary  property .   In  the  long  term ,  we 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  want  to  do  away  with  it .   Explicitly  scramble  sort  order  if 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  sane_behavior  so  that  no  such  expectation  exists  in  the  new  interface . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Scrambling  is  done  by  swapping  every  two  consecutive  bits ,  which  is 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  non - identity  one - to - one  mapping  which  disturbs  sort  order  sufficiently . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  pid_t  pid_fry ( pid_t  pid )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									unsigned  a  =  pid  &  0x55555555 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									unsigned  b  =  pid  &  0xAAAAAAAA ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  ( a  < <  1 )  |  ( b  > >  1 ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  pid_t  cgroup_pid_fry ( struct  cgroup  * cgrp ,  pid_t  pid )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( cgroup_sane_behavior ( cgrp ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  pid_fry ( pid ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									else 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  pid ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cmppid ( const  void  * a ,  const  void  * b )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  * ( pid_t  * ) a  -  * ( pid_t  * ) b ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  fried_cmppid ( const  void  * a ,  const  void  * b )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  pid_fry ( * ( pid_t  * ) a )  -  pid_fry ( * ( pid_t  * ) b ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  struct  cgroup_pidlist  * cgroup_pidlist_find ( struct  cgroup  * cgrp ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
														  enum  cgroup_filetype  type ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup_pidlist  * l ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* don't need task_nsproxy() if we're looking at ourself */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  pid_namespace  * ns  =  task_active_pid_ns ( current ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgrp - > pidlist_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									list_for_each_entry ( l ,  & cgrp - > pidlists ,  links ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( l - > key . type  = =  type  & &  l - > key . ns  = =  ns ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											return  l ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  NULL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:27 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  find  the  appropriate  pidlist  for  our  purpose  ( given  procs  vs  tasks ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  returns  with  the  lock  on  that  pidlist  already  held ,  and  takes  care 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  of  the  use  count ,  or  returns  NULL  with  no  locks  held  if  we ' re  out  of 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  memory . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  struct  cgroup_pidlist  * cgroup_pidlist_find_create ( struct  cgroup  * cgrp ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
														enum  cgroup_filetype  type ) 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:27 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup_pidlist  * l ; 
							 
						 
					
						
							
								
									
										
										
										
											2010-03-10 15:22:12 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgrp - > pidlist_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									l  =  cgroup_pidlist_find ( cgrp ,  type ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( l ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  l ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:27 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* entry not found; create a new one */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									l  =  kzalloc ( sizeof ( struct  cgroup_pidlist ) ,  GFP_KERNEL ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! l ) 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:27 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  l ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									INIT_DELAYED_WORK ( & l - > destroy_dwork ,  cgroup_pidlist_destroy_work_fn ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:27 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									l - > key . type  =  type ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* don't need task_nsproxy() if we're looking at ourself */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									l - > key . ns  =  get_pid_ns ( task_active_pid_ns ( current ) ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:27 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									l - > owner  =  cgrp ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									list_add ( & l - > links ,  & cgrp - > pidlists ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  l ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Load  a  cgroup ' s  pidarray  with  either  procs '  tgids  or  tasks '  pids 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:27 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  pidlist_array_load ( struct  cgroup  * cgrp ,  enum  cgroup_filetype  type ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											      struct  cgroup_pidlist  * * lp ) 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									pid_t  * array ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									int  length ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									int  pid ,  n  =  0 ;  /* used for populating the array */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  css_task_iter  it ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  task_struct  * tsk ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_pidlist  * l ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgrp - > pidlist_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  If  cgroup  gets  more  users  after  we  read  count ,  we  won ' t  have 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  enough  space  -  tough .   This  race  is  indistinguishable  to  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  caller  from  the  case  that  the  additional  cgroup  users  didn ' t 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  show  up  until  sometime  later  on . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									length  =  cgroup_task_count ( cgrp ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:28 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									array  =  pidlist_allocate ( length ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! array ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  - ENOMEM ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* now, populate the array */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:00 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									css_task_iter_start ( & cgrp - > self ,  & it ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									while  ( ( tsk  =  css_task_iter_next ( & it ) ) )  { 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( unlikely ( n  = =  length ) ) 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											break ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										/* get tgid or pid for procs or tasks file respectively */ 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:27 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( type  = =  CGROUP_FILE_PROCS ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											pid  =  task_tgid_vnr ( tsk ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										else 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											pid  =  task_pid_vnr ( tsk ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( pid  >  0 )  /* make sure to only use valid results */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											array [ n + + ]  =  pid ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									css_task_iter_end ( & it ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									length  =  n ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* now sort & (if procs) strip out duplicates */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( cgroup_sane_behavior ( cgrp ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										sort ( array ,  length ,  sizeof ( pid_t ) ,  fried_cmppid ,  NULL ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									else 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										sort ( array ,  length ,  sizeof ( pid_t ) ,  cmppid ,  NULL ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:27 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( type  = =  CGROUP_FILE_PROCS ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-03-12 15:36:00 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										length  =  pidlist_uniq ( array ,  length ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									l  =  cgroup_pidlist_find_create ( cgrp ,  type ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:27 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! l )  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										mutex_unlock ( & cgrp - > pidlist_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:28 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										pidlist_free ( array ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:27 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  - ENOMEM ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* store array, freeing old if necessary */ 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:28 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									pidlist_free ( l - > list ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									l - > list  =  array ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									l - > length  =  length ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:27 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									* lp  =  l ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												Add cgroupstats
This patch is inspired by the discussion at
http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics
as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.  The
patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward
ported)
This patch implements per cgroup statistics infrastructure and re-uses
code from the taskstats interface.  A new set of cgroup operations are
registered with commands and attributes.  It should be very easy to
*extend* per cgroup statistics, by adding members to the cgroupstats
structure.
The current model for cgroupstats is a pull, a push model (to post
statistics on interesting events), should be very easy to add.  Currently
user space requests for statistics by passing the cgroup file
descriptor.  Statistics about the state of all the tasks in the cgroup
is returned to user space.
TODO's/NOTE:
This patch provides an infrastructure for implementing cgroup statistics.
Based on the needs of each controller, we can incrementally add more statistics,
event based support for notification of statistics, accumulation of taskstats
into cgroup statistics in the future.
Sample output
# ./cgroupstats -C /cgroup/a
sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0
# ./cgroupstats -C /cgroup/
sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0
If the approach looks good, I'll enhance and post the user space utility for
the same
Feedback, comments, test results are always welcome!
[akpm@linux-foundation.org: build fix]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
										
										
											2008-02-23 15:24:09 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  cgroupstats_build  -  build  and  fill  cgroupstats 
							 
						 
					
						
							
								
									
										
											 
										
											
												Add cgroupstats
This patch is inspired by the discussion at
http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics
as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.  The
patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward
ported)
This patch implements per cgroup statistics infrastructure and re-uses
code from the taskstats interface.  A new set of cgroup operations are
registered with commands and attributes.  It should be very easy to
*extend* per cgroup statistics, by adding members to the cgroupstats
structure.
The current model for cgroupstats is a pull, a push model (to post
statistics on interesting events), should be very easy to add.  Currently
user space requests for statistics by passing the cgroup file
descriptor.  Statistics about the state of all the tasks in the cgroup
is returned to user space.
TODO's/NOTE:
This patch provides an infrastructure for implementing cgroup statistics.
Based on the needs of each controller, we can incrementally add more statistics,
event based support for notification of statistics, accumulation of taskstats
into cgroup statistics in the future.
Sample output
# ./cgroupstats -C /cgroup/a
sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0
# ./cgroupstats -C /cgroup/
sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0
If the approach looks good, I'll enhance and post the user space utility for
the same
Feedback, comments, test results are always welcome!
[akpm@linux-foundation.org: build fix]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ stats :  cgroupstats  to  fill  information  into 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ dentry :  A  dentry  entry  belonging  to  the  cgroup  for  which  stats  have 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  been  requested . 
							 
						 
					
						
							
								
									
										
										
										
											2008-02-23 15:24:09 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Build  and  fill  cgroupstats  so  that  taskstats  can  export  it  to  user 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  space . 
							 
						 
					
						
							
								
									
										
											 
										
											
												Add cgroupstats
This patch is inspired by the discussion at
http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics
as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.  The
patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward
ported)
This patch implements per cgroup statistics infrastructure and re-uses
code from the taskstats interface.  A new set of cgroup operations are
registered with commands and attributes.  It should be very easy to
*extend* per cgroup statistics, by adding members to the cgroupstats
structure.
The current model for cgroupstats is a pull, a push model (to post
statistics on interesting events), should be very easy to add.  Currently
user space requests for statistics by passing the cgroup file
descriptor.  Statistics about the state of all the tasks in the cgroup
is returned to user space.
TODO's/NOTE:
This patch provides an infrastructure for implementing cgroup statistics.
Based on the needs of each controller, we can incrementally add more statistics,
event based support for notification of statistics, accumulation of taskstats
into cgroup statistics in the future.
Sample output
# ./cgroupstats -C /cgroup/a
sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0
# ./cgroupstats -C /cgroup/
sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0
If the approach looks good, I'll enhance and post the user space utility for
the same
Feedback, comments, test results are always welcome!
[akpm@linux-foundation.org: build fix]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								int  cgroupstats_build ( struct  cgroupstats  * stats ,  struct  dentry  * dentry )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  kernfs_node  * kn  =  kernfs_node_from_dentry ( dentry ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:40:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  css_task_iter  it ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Add cgroupstats
This patch is inspired by the discussion at
http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics
as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.  The
patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward
ported)
This patch implements per cgroup statistics infrastructure and re-uses
code from the taskstats interface.  A new set of cgroup operations are
registered with commands and attributes.  It should be very easy to
*extend* per cgroup statistics, by adding members to the cgroupstats
structure.
The current model for cgroupstats is a pull, a push model (to post
statistics on interesting events), should be very easy to add.  Currently
user space requests for statistics by passing the cgroup file
descriptor.  Statistics about the state of all the tasks in the cgroup
is returned to user space.
TODO's/NOTE:
This patch provides an infrastructure for implementing cgroup statistics.
Based on the needs of each controller, we can incrementally add more statistics,
event based support for notification of statistics, accumulation of taskstats
into cgroup statistics in the future.
Sample output
# ./cgroupstats -C /cgroup/a
sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0
# ./cgroupstats -C /cgroup/
sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0
If the approach looks good, I'll enhance and post the user space utility for
the same
Feedback, comments, test results are always welcome!
[akpm@linux-foundation.org: build fix]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  task_struct  * tsk ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-11-19 15:36:48 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* it should be kernfs_node belonging to cgroupfs and is a directory */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( dentry - > d_sb - > s_type  ! =  & cgroup_fs_type  | |  ! kn  | | 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									    kernfs_type ( kn )  ! =  KERNFS_DIR ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  - EINVAL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-14 16:54:28 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_lock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												Add cgroupstats
This patch is inspired by the discussion at
http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics
as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.  The
patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward
ported)
This patch implements per cgroup statistics infrastructure and re-uses
code from the taskstats interface.  A new set of cgroup operations are
registered with commands and attributes.  It should be very easy to
*extend* per cgroup statistics, by adding members to the cgroupstats
structure.
The current model for cgroupstats is a pull, a push model (to post
statistics on interesting events), should be very easy to add.  Currently
user space requests for statistics by passing the cgroup file
descriptor.  Statistics about the state of all the tasks in the cgroup
is returned to user space.
TODO's/NOTE:
This patch provides an infrastructure for implementing cgroup statistics.
Based on the needs of each controller, we can incrementally add more statistics,
event based support for notification of statistics, accumulation of taskstats
into cgroup statistics in the future.
Sample output
# ./cgroupstats -C /cgroup/a
sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0
# ./cgroupstats -C /cgroup/
sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0
If the approach looks good, I'll enhance and post the user space utility for
the same
Feedback, comments, test results are always welcome!
[akpm@linux-foundation.org: build fix]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  We  aren ' t  being  called  from  kernfs  and  there ' s  no  guarantee  on 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:11:01 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  @ kn - > priv ' s  validity .   For  this  and  css_tryget_online_from_dir ( ) , 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  @ kn - > priv  is  RCU  safe .   Let ' s  do  the  RCU  dancing . 
							 
						 
					
						
							
								
									
										
											 
										
											
												Add cgroupstats
This patch is inspired by the discussion at
http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics
as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.  The
patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward
ported)
This patch implements per cgroup statistics infrastructure and re-uses
code from the taskstats interface.  A new set of cgroup operations are
registered with commands and attributes.  It should be very easy to
*extend* per cgroup statistics, by adding members to the cgroupstats
structure.
The current model for cgroupstats is a pull, a push model (to post
statistics on interesting events), should be very easy to add.  Currently
user space requests for statistics by passing the cgroup file
descriptor.  Statistics about the state of all the tasks in the cgroup
is returned to user space.
TODO's/NOTE:
This patch provides an infrastructure for implementing cgroup statistics.
Based on the needs of each controller, we can incrementally add more statistics,
event based support for notification of statistics, accumulation of taskstats
into cgroup statistics in the future.
Sample output
# ./cgroupstats -C /cgroup/a
sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0
# ./cgroupstats -C /cgroup/
sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0
If the approach looks good, I'll enhance and post the user space utility for
the same
Feedback, comments, test results are always welcome!
[akpm@linux-foundation.org: build fix]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									rcu_read_lock ( ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgrp  =  rcu_dereference ( kn - > priv ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-14 16:54:28 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! cgrp  | |  cgroup_is_dead ( cgrp ) )  { 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										rcu_read_unlock ( ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-14 16:54:28 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										mutex_unlock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  - ENOENT ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-14 16:54:28 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									rcu_read_unlock ( ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Add cgroupstats
This patch is inspired by the discussion at
http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics
as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.  The
patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward
ported)
This patch implements per cgroup statistics infrastructure and re-uses
code from the taskstats interface.  A new set of cgroup operations are
registered with commands and attributes.  It should be very easy to
*extend* per cgroup statistics, by adding members to the cgroupstats
structure.
The current model for cgroupstats is a pull, a push model (to post
statistics on interesting events), should be very easy to add.  Currently
user space requests for statistics by passing the cgroup file
descriptor.  Statistics about the state of all the tasks in the cgroup
is returned to user space.
TODO's/NOTE:
This patch provides an infrastructure for implementing cgroup statistics.
Based on the needs of each controller, we can incrementally add more statistics,
event based support for notification of statistics, accumulation of taskstats
into cgroup statistics in the future.
Sample output
# ./cgroupstats -C /cgroup/a
sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0
# ./cgroupstats -C /cgroup/
sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0
If the approach looks good, I'll enhance and post the user space utility for
the same
Feedback, comments, test results are always welcome!
[akpm@linux-foundation.org: build fix]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:00 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									css_task_iter_start ( & cgrp - > self ,  & it ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									while  ( ( tsk  =  css_task_iter_next ( & it ) ) )  { 
							 
						 
					
						
							
								
									
										
											 
										
											
												Add cgroupstats
This patch is inspired by the discussion at
http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics
as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.  The
patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward
ported)
This patch implements per cgroup statistics infrastructure and re-uses
code from the taskstats interface.  A new set of cgroup operations are
registered with commands and attributes.  It should be very easy to
*extend* per cgroup statistics, by adding members to the cgroupstats
structure.
The current model for cgroupstats is a pull, a push model (to post
statistics on interesting events), should be very easy to add.  Currently
user space requests for statistics by passing the cgroup file
descriptor.  Statistics about the state of all the tasks in the cgroup
is returned to user space.
TODO's/NOTE:
This patch provides an infrastructure for implementing cgroup statistics.
Based on the needs of each controller, we can incrementally add more statistics,
event based support for notification of statistics, accumulation of taskstats
into cgroup statistics in the future.
Sample output
# ./cgroupstats -C /cgroup/a
sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0
# ./cgroupstats -C /cgroup/
sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0
If the approach looks good, I'll enhance and post the user space utility for
the same
Feedback, comments, test results are always welcome!
[akpm@linux-foundation.org: build fix]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										switch  ( tsk - > state )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										case  TASK_RUNNING : 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											stats - > nr_running + + ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											break ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										case  TASK_INTERRUPTIBLE : 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											stats - > nr_sleeping + + ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											break ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										case  TASK_UNINTERRUPTIBLE : 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											stats - > nr_uninterruptible + + ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											break ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										case  TASK_STOPPED : 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											stats - > nr_stopped + + ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											break ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										default : 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( delayacct_is_task_waiting_on_io ( tsk ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												stats - > nr_io_wait + + ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											break ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									css_task_iter_end ( & it ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Add cgroupstats
This patch is inspired by the discussion at
http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics
as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.  The
patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward
ported)
This patch implements per cgroup statistics infrastructure and re-uses
code from the taskstats interface.  A new set of cgroup operations are
registered with commands and attributes.  It should be very easy to
*extend* per cgroup statistics, by adding members to the cgroupstats
structure.
The current model for cgroupstats is a pull, a push model (to post
statistics on interesting events), should be very easy to add.  Currently
user space requests for statistics by passing the cgroup file
descriptor.  Statistics about the state of all the tasks in the cgroup
is returned to user space.
TODO's/NOTE:
This patch provides an infrastructure for implementing cgroup statistics.
Based on the needs of each controller, we can incrementally add more statistics,
event based support for notification of statistics, accumulation of taskstats
into cgroup statistics in the future.
Sample output
# ./cgroupstats -C /cgroup/a
sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0
# ./cgroupstats -C /cgroup/
sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0
If the approach looks good, I'll enhance and post the user space utility for
the same
Feedback, comments, test results are always welcome!
[akpm@linux-foundation.org: build fix]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-14 16:54:28 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_unlock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Add cgroupstats
This patch is inspired by the discussion at
http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics
as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.  The
patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward
ported)
This patch implements per cgroup statistics infrastructure and re-uses
code from the taskstats interface.  A new set of cgroup operations are
registered with commands and attributes.  It should be very easy to
*extend* per cgroup statistics, by adding members to the cgroupstats
structure.
The current model for cgroupstats is a pull, a push model (to post
statistics on interesting events), should be very easy to add.  Currently
user space requests for statistics by passing the cgroup file
descriptor.  Statistics about the state of all the tasks in the cgroup
is returned to user space.
TODO's/NOTE:
This patch provides an infrastructure for implementing cgroup statistics.
Based on the needs of each controller, we can incrementally add more statistics,
event based support for notification of statistics, accumulation of taskstats
into cgroup statistics in the future.
Sample output
# ./cgroupstats -C /cgroup/a
sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0
# ./cgroupstats -C /cgroup/
sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0
If the approach looks good, I'll enhance and post the user space utility for
the same
Feedback, comments, test results are always welcome!
[akpm@linux-foundation.org: build fix]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:25 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  seq_file  methods  for  the  tasks / procs  files .  The  seq_file  position  is  the 
							 
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  next  pid  to  display ;  the  seq_file  iterator  is  a  pointer  to  the  pid 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  in  the  cgroup - > l - > list  array . 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  * cgroup_pidlist_start ( struct  seq_file  * s ,  loff_t  * pos )  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Initially  we  receive  a  position  value  that  corresponds  to 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  one  more  than  the  last  pid  shown  ( or  0  on  the  first  call  or 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  after  a  seek  to  the  start ) .  Use  a  binary - search  to  find  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  next  pid  to  display ,  if  any 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  kernfs_open_file  * of  =  s - > private ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp  =  seq_css ( s ) - > cgroup ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_pidlist  * l ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									enum  cgroup_filetype  type  =  seq_cft ( s ) - > private ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  index  =  0 ,  pid  =  * pos ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  * iter ,  ret ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									mutex_lock ( & cgrp - > pidlist_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  ! NULL  @ of - > priv  indicates  that  this  isn ' t  the  first  start ( ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  after  open .   If  the  matching  pidlist  is  around ,  we  can  use  that . 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  Look  for  it .   Note  that  @ of - > priv  can ' t  be  used  directly .   It 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  could  already  have  been  destroyed . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( of - > priv ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										of - > priv  =  cgroup_pidlist_find ( cgrp ,  type ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Either  this  is  the  first  start ( )  after  open  or  the  matching 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  pidlist  has  been  destroyed  inbetween .   Create  a  new  one . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! of - > priv )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										ret  =  pidlist_array_load ( cgrp ,  type , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													 ( struct  cgroup_pidlist  * * ) & of - > priv ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ret ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											return  ERR_PTR ( ret ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									l  =  of - > priv ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( pid )  { 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										int  end  =  l - > length ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-10-21 16:11:20 +11:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										while  ( index  <  end )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											int  mid  =  ( index  +  end )  /  2 ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											if  ( cgroup_pid_fry ( cgrp ,  l - > list [ mid ] )  = =  pid )  { 
							 
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												index  =  mid ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												break ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											}  else  if  ( cgroup_pid_fry ( cgrp ,  l - > list [ mid ] )  < =  pid ) 
							 
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												index  =  mid  +  1 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											else 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												end  =  mid ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* If we're off the end of the array, we're done */ 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( index  > =  l - > length ) 
							 
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  NULL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* Update the abstract position to be the actual pid that we found */ 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									iter  =  l - > list  +  index ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									* pos  =  cgroup_pid_fry ( cgrp ,  * iter ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  iter ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  cgroup_pidlist_stop ( struct  seq_file  * s ,  void  * v )  
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  kernfs_open_file  * of  =  s - > private ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_pidlist  * l  =  of - > priv ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( l ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										mod_delayed_work ( cgroup_pidlist_destroy_wq ,  & l - > destroy_dwork , 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:59 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												 CGROUP_PIDLIST_DESTROY_DELAY ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_unlock ( & seq_css ( s ) - > cgroup - > pidlist_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  * cgroup_pidlist_next ( struct  seq_file  * s ,  void  * v ,  loff_t  * pos )  
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  kernfs_open_file  * of  =  s - > private ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_pidlist  * l  =  of - > priv ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									pid_t  * p  =  v ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									pid_t  * end  =  l - > list  +  l - > length ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Advance  to  the  next  pid  in  the  array .  If  this  goes  off  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  end ,  we ' re  done 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									p + + ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( p  > =  end )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  NULL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									}  else  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										* pos  =  cgroup_pid_fry ( seq_css ( s ) - > cgroup ,  * p ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  p ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_pidlist_show ( struct  seq_file  * s ,  void  * v )  
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  seq_printf ( s ,  " %d \n " ,  * ( int  * ) v ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  u64  cgroup_read_notify_on_release ( struct  cgroup_subsys_state  * css ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													 struct  cftype  * cft ) 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  notify_on_release ( css - > cgroup ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_write_notify_on_release ( struct  cgroup_subsys_state  * css ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													  struct  cftype  * cft ,  u64  val ) 
							 
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:47:01 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									clear_bit ( CGRP_RELEASABLE ,  & css - > cgroup - > flags ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:47:01 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( val ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										set_bit ( CGRP_NOTIFY_ON_RELEASE ,  & css - > cgroup - > flags ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:47:01 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									else 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										clear_bit ( CGRP_NOTIFY_ON_RELEASE ,  & css - > cgroup - > flags ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:47:01 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  u64  cgroup_clone_children_read ( struct  cgroup_subsys_state  * css ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												      struct  cftype  * cft ) 
							 
						 
					
						
							
								
									
										
										
										
											2010-10-27 15:33:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  test_bit ( CGRP_CPUSET_CLONE_CHILDREN ,  & css - > cgroup - > flags ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2010-10-27 15:33:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_clone_children_write ( struct  cgroup_subsys_state  * css ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												       struct  cftype  * cft ,  u64  val ) 
							 
						 
					
						
							
								
									
										
										
										
											2010-10-27 15:33:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( val ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										set_bit ( CGRP_CPUSET_CLONE_CHILDREN ,  & css - > cgroup - > flags ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2010-10-27 15:33:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									else 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										clear_bit ( CGRP_CPUSET_CLONE_CHILDREN ,  & css - > cgroup - > flags ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2010-10-27 15:33:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-03 19:14:34 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  struct  cftype  cgroup_base_files [ ]  =  {  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									{ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-03 19:14:34 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										. name  =  " cgroup.procs " , 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										. seq_start  =  cgroup_pidlist_start , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. seq_next  =  cgroup_pidlist_next , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. seq_stop  =  cgroup_pidlist_stop , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. seq_show  =  cgroup_pidlist_show , 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										. private  =  CGROUP_FILE_PROCS , 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										. write  =  cgroup_procs_write , 
							 
						 
					
						
							
								
									
										
										
										
											2011-05-26 16:25:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										. mode  =  S_IRUGO  |  S_IWUSR , 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} , 
							 
						 
					
						
							
								
									
										
										
										
											2010-10-27 15:33:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									{ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. name  =  " cgroup.clone_children " , 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce sane_behavior mount option
It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.
As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility.  This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors.  As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.
Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default.  Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.
This patch introduces the mount option and changes the following
behaviors in cgroup core.
* Mount options "noprefix" and "clone_children" are disallowed.  Also,
  cgroupfs file cgroup.clone_children is not created.
* When mounting an existing superblock, mount options should match.
  This is currently pretty crazy.  If one mounts a cgroup, creates a
  subdirectory, unmounts it and then mount it again with different
  option, it looks like the new options are applied but they aren't.
* Remount is disallowed.
The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.
v2: Dropped unnecessary explicit file permission setting sane_behavior
    cftype entry as suggested by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
											 
										 
										
											2013-04-14 20:15:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										. flags  =  CFTYPE_INSANE , 
							 
						 
					
						
							
								
									
										
										
										
											2010-10-27 15:33:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										. read_u64  =  cgroup_clone_children_read , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. write_u64  =  cgroup_clone_children_write , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} , 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce sane_behavior mount option
It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.
As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility.  This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors.  As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.
Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default.  Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.
This patch introduces the mount option and changes the following
behaviors in cgroup core.
* Mount options "noprefix" and "clone_children" are disallowed.  Also,
  cgroupfs file cgroup.clone_children is not created.
* When mounting an existing superblock, mount options should match.
  This is currently pretty crazy.  If one mounts a cgroup, creates a
  subdirectory, unmounts it and then mount it again with different
  option, it looks like the new options are applied but they aren't.
* Remount is disallowed.
The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.
v2: Dropped unnecessary explicit file permission setting sane_behavior
    cftype entry as suggested by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
											 
										 
										
											2013-04-14 20:15:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									{ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. name  =  " cgroup.sane_behavior " , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. flags  =  CFTYPE_ONLY_ON_ROOT , 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										. seq_show  =  cgroup_sane_behavior_show , 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: introduce sane_behavior mount option
It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.
As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility.  This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors.  As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.
Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default.  Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.
This patch introduces the mount option and changes the following
behaviors in cgroup core.
* Mount options "noprefix" and "clone_children" are disallowed.  Also,
  cgroupfs file cgroup.clone_children is not created.
* When mounting an existing superblock, mount options should match.
  This is currently pretty crazy.  If one mounts a cgroup, creates a
  subdirectory, unmounts it and then mount it again with different
  option, it looks like the new options are applied but they aren't.
* Remount is disallowed.
The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.
v2: Dropped unnecessary explicit file permission setting sane_behavior
    cftype entry as suggested by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
											 
										 
										
											2013-04-14 20:15:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} , 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									{ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. name  =  " cgroup.controllers " , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. flags  =  CFTYPE_ONLY_ON_DFL  |  CFTYPE_ONLY_ON_ROOT , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. seq_show  =  cgroup_root_controllers_show , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									{ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. name  =  " cgroup.controllers " , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. flags  =  CFTYPE_ONLY_ON_DFL  |  CFTYPE_NOT_ON_ROOT , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. seq_show  =  cgroup_controllers_show , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									{ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. name  =  " cgroup.subtree_control " , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. flags  =  CFTYPE_ONLY_ON_DFL , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. seq_show  =  cgroup_subtree_control_show , 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:21 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										. write  =  cgroup_subtree_control_write , 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} , 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									{ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. name  =  " cgroup.populated " , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. flags  =  CFTYPE_ONLY_ON_DFL  |  CFTYPE_NOT_ON_ROOT , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. seq_show  =  cgroup_populated_show , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} , 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-03 19:14:34 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Historical  crazy  stuff .   These  don ' t  have  " cgroup. "   prefix  and 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  don ' t  exist  if  sane_behavior .   If  you ' re  depending  on  these ,  be 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  prepared  to  be  burned . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									{ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. name  =  " tasks " , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. flags  =  CFTYPE_INSANE , 		/* use "procs" instead */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										. seq_start  =  cgroup_pidlist_start , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. seq_next  =  cgroup_pidlist_next , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. seq_stop  =  cgroup_pidlist_stop , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. seq_show  =  cgroup_pidlist_show , 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										. private  =  CGROUP_FILE_TASKS , 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										. write  =  cgroup_tasks_write , 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-03 19:14:34 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										. mode  =  S_IRUGO  |  S_IWUSR , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									{ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. name  =  " notify_on_release " , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. flags  =  CFTYPE_INSANE , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. read_u64  =  cgroup_read_notify_on_release , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. write_u64  =  cgroup_write_notify_on_release , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} , 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									{ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. name  =  " release_agent " , 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-03 19:13:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										. flags  =  CFTYPE_INSANE  |  CFTYPE_ONLY_ON_ROOT , 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										. seq_show  =  cgroup_release_agent_show , 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:16:21 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										. write  =  cgroup_release_agent_write , 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										. max_write_len  =  PATH_MAX  -  1 , 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} , 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									{  } 	/* terminate */ 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								} ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-08-23 16:53:29 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
										
										
											2013-06-28 16:24:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  cgroup_populate_dir  -  create  subsys  files  in  a  cgroup  directory 
							 
						 
					
						
							
								
									
										
										
										
											2012-08-23 16:53:29 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ cgrp :  target  cgroup 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ subsys_mask :  mask  of  the  subsystem  ids  whose  files  should  be  added 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 16:24:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  On  failure ,  no  file  is  added . 
							 
						 
					
						
							
								
									
										
										
										
											2012-08-23 16:53:29 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_populate_dir ( struct  cgroup  * cgrp ,  unsigned  int  subsys_mask )  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-07-12 12:34:02 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  i ,  ret  =  0 ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* process cftsets of each subsystem */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-07-12 12:34:02 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for_each_subsys ( ss ,  i )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										struct  cftype  * cfts ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-07-12 12:34:02 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ! ( subsys_mask  &  ( 1  < <  i ) ) ) 
							 
						 
					
						
							
								
									
										
										
										
											2012-08-23 16:53:29 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										list_for_each_entry ( cfts ,  & ss - > cfts ,  node )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											ret  =  cgroup_addrm_files ( cgrp ,  cfts ,  true ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 16:24:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											if  ( ret  <  0 ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												goto  err ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 16:24:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								err :  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgroup_clear_dir ( cgrp ,  subsys_mask ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  ret ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: RCU protect each cgroup_subsys_state release
With the planned unified hierarchy, individual css's will be created
and destroyed dynamically across the lifetime of a cgroup.  To enable
such usages, css destruction is being decoupled from cgroup
destruction.  Most of the destruction path has been decoupled but the
actual free of css still depends on cgroup free path.
When all css refs are drained, css_release() kicks off
css_free_work_fn() which puts the cgroup.  When the cgroup refcnt
reaches zero, cgroup_diput() is invoked which in turn schedules RCU
free of the cgroup.  After a grace period, all css's are freed along
with the cgroup itself.
This patch moves the RCU grace period and css freeing from cgroup
release path to css release path.  css_release(), instead of kicking
off css_free_work_fn() directly, schedules RCU callback
css_free_rcu_fn() which in turn kicks off css_free_work_fn() after a
RCU grace period.  css_free_work_fn() is updated to free the css
directly.
The five-way punting - percpu ref kill confirmation, a work item,
percpu ref release, RCU grace period, and again a work item - is quite
hairy but the work items are there only to provide process context and
the actual sequence is kill confirm -> release -> RCU free, which
isn't simple but not too crazy.
This removes cgroup_css() usage after offline_css() allowing clearing
cgroup->subsys[] from offline_css(), which makes it consistent with
online_css() and brings it closer to proper lifetime management for
individual css's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2013-08-13 20:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  css  destruction  is  four - stage  process . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  1.  Destruction  starts .   Killing  of  the  percpu_ref  is  initiated . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *     Implemented  in  kill_css ( ) . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  2.  When  the  percpu_ref  is  confirmed  to  be  visible  as  killed  on  all  CPUs 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:11:01 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *     and  thus  css_tryget_online ( )  is  guaranteed  to  fail ,  the  css  can  be 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *     offlined  by  invoking  offline_css ( ) .   After  offlining ,  the  base  ref  is 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *     put .   Implemented  in  css_killed_work_fn ( ) . 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: RCU protect each cgroup_subsys_state release
With the planned unified hierarchy, individual css's will be created
and destroyed dynamically across the lifetime of a cgroup.  To enable
such usages, css destruction is being decoupled from cgroup
destruction.  Most of the destruction path has been decoupled but the
actual free of css still depends on cgroup free path.
When all css refs are drained, css_release() kicks off
css_free_work_fn() which puts the cgroup.  When the cgroup refcnt
reaches zero, cgroup_diput() is invoked which in turn schedules RCU
free of the cgroup.  After a grace period, all css's are freed along
with the cgroup itself.
This patch moves the RCU grace period and css freeing from cgroup
release path to css release path.  css_release(), instead of kicking
off css_free_work_fn() directly, schedules RCU callback
css_free_rcu_fn() which in turn kicks off css_free_work_fn() after a
RCU grace period.  css_free_work_fn() is updated to free the css
directly.
The five-way punting - percpu ref kill confirmation, a work item,
percpu ref release, RCU grace period, and again a work item - is quite
hairy but the work items are there only to provide process context and
the actual sequence is kill confirm -> release -> RCU free, which
isn't simple but not too crazy.
This removes cgroup_css() usage after offline_css() allowing clearing
cgroup->subsys[] from offline_css(), which makes it consistent with
online_css() and brings it closer to proper lifetime management for
individual css's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2013-08-13 20:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  3.  When  the  percpu_ref  reaches  zero ,  the  only  possible  remaining 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *     accessors  are  inside  RCU  read  sections .   css_release ( )  schedules  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *     RCU  callback . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  4.  After  the  grace  period ,  the  css  can  be  freed .   Implemented  in 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *     css_free_work_fn ( ) . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  It  is  actually  hairier  because  both  step  2  and  4  require  process  context 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  and  thus  involve  punting  to  css - > destroy_work  adding  two  additional 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  steps  to  the  already  complex  sequence . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 11:01:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  css_free_work_fn ( struct  work_struct  * work )  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css->refcnt clearing on cgroup removal optional
Currently, cgroup removal tries to drain all css references.  If there
are active css references, the removal logic waits and retries
->pre_detroy() until either all refs drop to zero or removal is
cancelled.
This semantics is unusual and adds non-trivial complexity to cgroup
core and IMHO is fundamentally misguided in that it couples internal
implementation details (references to internal data structure) with
externally visible operation (rmdir).  To userland, this is a behavior
peculiarity which is unnecessary and difficult to expect (css refs is
otherwise invisible from userland), and, to policy implementations,
this is an unnecessary restriction (e.g. blkcg wants to hold css refs
for caching purposes but can't as that becomes visible as rmdir hang).
Unfortunately, memcg currently depends on ->pre_destroy() retrials and
cgroup removal vetoing and can't be immmediately switched to the new
behavior.  This patch introduces the new behavior of not waiting for
css refs to drain and maintains the old behavior for subsystems which
have __DEPRECATED_clear_css_refs set.
Once, memcg is updated, we can drop the code paths for the old
behavior as proposed in the following patch.  Note that the following
patch is incorrect in that dput work item is in cgroup and may lose
some of dputs when multiples css's are released back-to-back, and
__css_put() triggers check_for_release() when refcnt reaches 0 instead
of 1; however, it shows what part can be removed.
  http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251
Note that, in not-too-distant future, cgroup core will start emitting
warning messages for subsys which require the old behavior, so please
get moving.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
											 
										 
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * css  = 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 11:01:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										container_of ( work ,  struct  cgroup_subsys_state ,  destroy_work ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: RCU protect each cgroup_subsys_state release
With the planned unified hierarchy, individual css's will be created
and destroyed dynamically across the lifetime of a cgroup.  To enable
such usages, css destruction is being decoupled from cgroup
destruction.  Most of the destruction path has been decoupled but the
actual free of css still depends on cgroup free path.
When all css refs are drained, css_release() kicks off
css_free_work_fn() which puts the cgroup.  When the cgroup refcnt
reaches zero, cgroup_diput() is invoked which in turn schedules RCU
free of the cgroup.  After a grace period, all css's are freed along
with the cgroup itself.
This patch moves the RCU grace period and css freeing from cgroup
release path to css release path.  css_release(), instead of kicking
off css_free_work_fn() directly, schedules RCU callback
css_free_rcu_fn() which in turn kicks off css_free_work_fn() after a
RCU grace period.  css_free_work_fn() is updated to free the css
directly.
The five-way punting - percpu ref kill confirmation, a work item,
percpu ref release, RCU grace period, and again a work item - is quite
hairy but the work items are there only to provide process context and
the actual sequence is kill confirm -> release -> RCU free, which
isn't simple but not too crazy.
This removes cgroup_css() usage after offline_css() allowing clearing
cgroup->subsys[] from offline_css(), which makes it consistent with
online_css() and brings it closer to proper lifetime management for
individual css's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2013-08-13 20:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp  =  css - > cgroup ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css->refcnt clearing on cgroup removal optional
Currently, cgroup removal tries to drain all css references.  If there
are active css references, the removal logic waits and retries
->pre_detroy() until either all refs drop to zero or removal is
cancelled.
This semantics is unusual and adds non-trivial complexity to cgroup
core and IMHO is fundamentally misguided in that it couples internal
implementation details (references to internal data structure) with
externally visible operation (rmdir).  To userland, this is a behavior
peculiarity which is unnecessary and difficult to expect (css refs is
otherwise invisible from userland), and, to policy implementations,
this is an unnecessary restriction (e.g. blkcg wants to hold css refs
for caching purposes but can't as that becomes visible as rmdir hang).
Unfortunately, memcg currently depends on ->pre_destroy() retrials and
cgroup removal vetoing and can't be immmediately switched to the new
behavior.  This patch introduces the new behavior of not waiting for
css refs to drain and maintains the old behavior for subsystems which
have __DEPRECATED_clear_css_refs set.
Once, memcg is updated, we can drop the code paths for the old
behavior as proposed in the following patch.  Note that the following
patch is incorrect in that dput work item is in cgroup and may lose
some of dputs when multiples css's are released back-to-back, and
__css_put() triggers check_for_release() when refcnt reaches 0 instead
of 1; however, it shows what part can be removed.
  http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251
Note that, in not-too-distant future, cgroup core will start emitting
warning messages for subsys which require the old behavior, so please
get moving.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
											 
										 
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( css - > ss )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/* css free path */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( css - > parent ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											css_put ( css - > parent ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 11:01:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										css - > ss - > css_free ( css ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgroup_put ( cgrp ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									}  else  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/* cgroup free path */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										atomic_dec ( & cgrp - > root - > nr_cgrps ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgroup_pidlist_destroy_all ( cgrp ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( cgroup_parent ( cgrp ) )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 *  We  get  a  ref  to  the  parent ,  and  put  the  ref  when 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 *  this  cgroup  is  being  freed ,  so  it ' s  guaranteed 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 *  that  the  parent  won ' t  be  destroyed  before  its 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 *  children . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											cgroup_put ( cgroup_parent ( cgrp ) ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											kernfs_put ( cgrp - > kn ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											kfree ( cgrp ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										}  else  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 *  This  is  root  cgroup ' s  refcnt  reaching  zero , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 *  which  indicates  that  the  root  should  be 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 *  released . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											cgroup_destroy_root ( cgrp - > root ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css->refcnt clearing on cgroup removal optional
Currently, cgroup removal tries to drain all css references.  If there
are active css references, the removal logic waits and retries
->pre_detroy() until either all refs drop to zero or removal is
cancelled.
This semantics is unusual and adds non-trivial complexity to cgroup
core and IMHO is fundamentally misguided in that it couples internal
implementation details (references to internal data structure) with
externally visible operation (rmdir).  To userland, this is a behavior
peculiarity which is unnecessary and difficult to expect (css refs is
otherwise invisible from userland), and, to policy implementations,
this is an unnecessary restriction (e.g. blkcg wants to hold css refs
for caching purposes but can't as that becomes visible as rmdir hang).
Unfortunately, memcg currently depends on ->pre_destroy() retrials and
cgroup removal vetoing and can't be immmediately switched to the new
behavior.  This patch introduces the new behavior of not waiting for
css refs to drain and maintains the old behavior for subsystems which
have __DEPRECATED_clear_css_refs set.
Once, memcg is updated, we can drop the code paths for the old
behavior as proposed in the following patch.  Note that the following
patch is incorrect in that dput work item is in cgroup and may lose
some of dputs when multiples css's are released back-to-back, and
__css_put() triggers check_for_release() when refcnt reaches 0 instead
of 1; however, it shows what part can be removed.
  http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251
Note that, in not-too-distant future, cgroup core will start emitting
warning messages for subsys which require the old behavior, so please
get moving.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
											 
										 
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: RCU protect each cgroup_subsys_state release
With the planned unified hierarchy, individual css's will be created
and destroyed dynamically across the lifetime of a cgroup.  To enable
such usages, css destruction is being decoupled from cgroup
destruction.  Most of the destruction path has been decoupled but the
actual free of css still depends on cgroup free path.
When all css refs are drained, css_release() kicks off
css_free_work_fn() which puts the cgroup.  When the cgroup refcnt
reaches zero, cgroup_diput() is invoked which in turn schedules RCU
free of the cgroup.  After a grace period, all css's are freed along
with the cgroup itself.
This patch moves the RCU grace period and css freeing from cgroup
release path to css release path.  css_release(), instead of kicking
off css_free_work_fn() directly, schedules RCU callback
css_free_rcu_fn() which in turn kicks off css_free_work_fn() after a
RCU grace period.  css_free_work_fn() is updated to free the css
directly.
The five-way punting - percpu ref kill confirmation, a work item,
percpu ref release, RCU grace period, and again a work item - is quite
hairy but the work items are there only to provide process context and
the actual sequence is kill confirm -> release -> RCU free, which
isn't simple but not too crazy.
This removes cgroup_css() usage after offline_css() allowing clearing
cgroup->subsys[] from offline_css(), which makes it consistent with
online_css() and brings it closer to proper lifetime management for
individual css's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2013-08-13 20:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  css_free_rcu_fn ( struct  rcu_head  * rcu_head )  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller.  As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability.  For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable.  In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref.  css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s.  This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs.  The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id().  While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it.  Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
      the return value.  This causes two problems - the obvious lack
      of error handling and percpu_ref_init() being called from
      cgroup_init_subsys() before the allocators are up, which
      triggers warnings but doesn't cause actual problems as the
      refcnt isn't used for roots anyway.  Fix both by moving
      percpu_ref_init() to cgroup_create().
    - The base references were put too early by
      percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
      refs one extra time.  This wasn't noticeable because css's go
      through another RCU grace period before being freed.  Update
      cgroup_destroy_locked() to grab an extra reference before
      killing the refcnts.  This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
											 
										 
										
											2013-06-13 19:39:16 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * css  = 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: RCU protect each cgroup_subsys_state release
With the planned unified hierarchy, individual css's will be created
and destroyed dynamically across the lifetime of a cgroup.  To enable
such usages, css destruction is being decoupled from cgroup
destruction.  Most of the destruction path has been decoupled but the
actual free of css still depends on cgroup free path.
When all css refs are drained, css_release() kicks off
css_free_work_fn() which puts the cgroup.  When the cgroup refcnt
reaches zero, cgroup_diput() is invoked which in turn schedules RCU
free of the cgroup.  After a grace period, all css's are freed along
with the cgroup itself.
This patch moves the RCU grace period and css freeing from cgroup
release path to css release path.  css_release(), instead of kicking
off css_free_work_fn() directly, schedules RCU callback
css_free_rcu_fn() which in turn kicks off css_free_work_fn() after a
RCU grace period.  css_free_work_fn() is updated to free the css
directly.
The five-way punting - percpu ref kill confirmation, a work item,
percpu ref release, RCU grace period, and again a work item - is quite
hairy but the work items are there only to provide process context and
the actual sequence is kill confirm -> release -> RCU free, which
isn't simple but not too crazy.
This removes cgroup_css() usage after offline_css() allowing clearing
cgroup->subsys[] from offline_css(), which makes it consistent with
online_css() and brings it closer to proper lifetime management for
individual css's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2013-08-13 20:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										container_of ( rcu_head ,  struct  cgroup_subsys_state ,  rcu_head ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller.  As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability.  For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable.  In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref.  css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s.  This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs.  The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id().  While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it.  Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
      the return value.  This causes two problems - the obvious lack
      of error handling and percpu_ref_init() being called from
      cgroup_init_subsys() before the allocators are up, which
      triggers warnings but doesn't cause actual problems as the
      refcnt isn't used for roots anyway.  Fix both by moving
      percpu_ref_init() to cgroup_create().
    - The base references were put too early by
      percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
      refs one extra time.  This wasn't noticeable because css's go
      through another RCU grace period before being freed.  Update
      cgroup_destroy_locked() to grab an extra reference before
      killing the refcnts.  This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
											 
										 
										
											2013-06-13 19:39:16 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 11:01:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									INIT_WORK ( & css - > destroy_work ,  css_free_work_fn ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use a dedicated workqueue for cgroup destruction
Since be44562613851 ("cgroup: remove synchronize_rcu() from
cgroup_diput()"), cgroup destruction path makes use of workqueue.  css
freeing is performed from a work item from that point on and a later
commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two
steps"), moves css offlining to workqueue too.
As cgroup destruction isn't depended upon for memory reclaim, the
destruction work items were put on the system_wq; unfortunately, some
controller may block in the destruction path for considerable duration
while holding cgroup_mutex.  As large part of destruction path is
synchronized through cgroup_mutex, when combined with high rate of
cgroup removals, this has potential to fill up system_wq's max_active
of 256.
Also, it turns out that memcg's css destruction path ends up queueing
and waiting for work items on system_wq through work_on_cpu().  If
such operation happens while system_wq is fully occupied by cgroup
destruction work items, work_on_cpu() can't make forward progress
because system_wq is full and other destruction work items on
system_wq can't make forward progress because the work item waiting
for work_on_cpu() is holding cgroup_mutex, leading to deadlock.
This can be fixed by queueing destruction work items on a separate
workqueue.  This patch creates a dedicated workqueue -
cgroup_destroy_wq - for this purpose.  As these work items shouldn't
have inter-dependencies and mostly serialized by cgroup_mutex anyway,
giving high concurrency level doesn't buy anything and the workqueue's
@max_active is set to 1 so that destruction work items are executed
one by one on each CPU.
Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
cgroup_destroy_wq can't be allocated from cgroup_init().  Do it from a
separate core_initcall().  In the future, we probably want to reorder
so that workqueue init happens before cgroup_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Shawn Bohrer <shawn.bohrer@gmail.com>
Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
Cc: stable@vger.kernel.org # v3.9+
											 
										 
										
											2013-11-22 17:14:39 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									queue_work ( cgroup_destroy_wq ,  & css - > destroy_work ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css->refcnt clearing on cgroup removal optional
Currently, cgroup removal tries to drain all css references.  If there
are active css references, the removal logic waits and retries
->pre_detroy() until either all refs drop to zero or removal is
cancelled.
This semantics is unusual and adds non-trivial complexity to cgroup
core and IMHO is fundamentally misguided in that it couples internal
implementation details (references to internal data structure) with
externally visible operation (rmdir).  To userland, this is a behavior
peculiarity which is unnecessary and difficult to expect (css refs is
otherwise invisible from userland), and, to policy implementations,
this is an unnecessary restriction (e.g. blkcg wants to hold css refs
for caching purposes but can't as that becomes visible as rmdir hang).
Unfortunately, memcg currently depends on ->pre_destroy() retrials and
cgroup removal vetoing and can't be immmediately switched to the new
behavior.  This patch introduces the new behavior of not waiting for
css refs to drain and maintains the old behavior for subsystems which
have __DEPRECATED_clear_css_refs set.
Once, memcg is updated, we can drop the code paths for the old
behavior as proposed in the following patch.  Note that the following
patch is incorrect in that dput work item is in cgroup and may lose
some of dputs when multiples css's are released back-to-back, and
__css_put() triggers check_for_release() when refcnt reaches 0 instead
of 1; however, it shows what part can be removed.
  http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251
Note that, in not-too-distant future, cgroup core will start emitting
warning messages for subsys which require the old behavior, so please
get moving.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
											 
										 
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  css_release_work_fn ( struct  work_struct  * work )  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller.  As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability.  For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable.  In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref.  css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s.  This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs.  The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id().  While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it.  Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
      the return value.  This causes two problems - the obvious lack
      of error handling and percpu_ref_init() being called from
      cgroup_init_subsys() before the allocators are up, which
      triggers warnings but doesn't cause actual problems as the
      refcnt isn't used for roots anyway.  Fix both by moving
      percpu_ref_init() to cgroup_create().
    - The base references were put too early by
      percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
      refs one extra time.  This wasn't noticeable because css's go
      through another RCU grace period before being freed.  Update
      cgroup_destroy_locked() to grab an extra reference before
      killing the refcnts.  This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
											 
										 
										
											2013-06-13 19:39:16 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * css  = 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										container_of ( work ,  struct  cgroup_subsys_state ,  destroy_work ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss  =  css - > ss ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp  =  css - > cgroup ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:49 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_lock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:49 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									css - > flags  | =  CSS_RELEASED ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:49 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									list_del_rcu ( & css - > sibling ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ss )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/* css release path */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgroup_idr_remove ( & ss - > css_idr ,  css - > id ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									}  else  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/* cgroup release path */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgroup_idr_remove ( & cgrp - > root - > cgroup_idr ,  cgrp - > id ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgrp - > id  =  - 1 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller.  As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability.  For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable.  In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref.  css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s.  This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs.  The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id().  While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it.  Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
      the return value.  This causes two problems - the obvious lack
      of error handling and percpu_ref_init() being called from
      cgroup_init_subsys() before the allocators are up, which
      triggers warnings but doesn't cause actual problems as the
      refcnt isn't used for roots anyway.  Fix both by moving
      percpu_ref_init() to cgroup_create().
    - The base references were put too early by
      percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
      refs one extra time.  This wasn't noticeable because css's go
      through another RCU grace period before being freed.  Update
      cgroup_destroy_locked() to grab an extra reference before
      killing the refcnts.  This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
											 
										 
										
											2013-06-13 19:39:16 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:49 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_unlock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: RCU protect each cgroup_subsys_state release
With the planned unified hierarchy, individual css's will be created
and destroyed dynamically across the lifetime of a cgroup.  To enable
such usages, css destruction is being decoupled from cgroup
destruction.  Most of the destruction path has been decoupled but the
actual free of css still depends on cgroup free path.
When all css refs are drained, css_release() kicks off
css_free_work_fn() which puts the cgroup.  When the cgroup refcnt
reaches zero, cgroup_diput() is invoked which in turn schedules RCU
free of the cgroup.  After a grace period, all css's are freed along
with the cgroup itself.
This patch moves the RCU grace period and css freeing from cgroup
release path to css release path.  css_release(), instead of kicking
off css_free_work_fn() directly, schedules RCU callback
css_free_rcu_fn() which in turn kicks off css_free_work_fn() after a
RCU grace period.  css_free_work_fn() is updated to free the css
directly.
The five-way punting - percpu ref kill confirmation, a work item,
percpu ref release, RCU grace period, and again a work item - is quite
hairy but the work items are there only to provide process context and
the actual sequence is kill confirm -> release -> RCU free, which
isn't simple but not too crazy.
This removes cgroup_css() usage after offline_css() allowing clearing
cgroup->subsys[] from offline_css(), which makes it consistent with
online_css() and brings it closer to proper lifetime management for
individual css's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2013-08-13 20:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									call_rcu ( & css - > rcu_head ,  css_free_rcu_fn ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller.  As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability.  For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable.  In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref.  css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s.  This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs.  The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id().  While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it.  Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
      the return value.  This causes two problems - the obvious lack
      of error handling and percpu_ref_init() being called from
      cgroup_init_subsys() before the allocators are up, which
      triggers warnings but doesn't cause actual problems as the
      refcnt isn't used for roots anyway.  Fix both by moving
      percpu_ref_init() to cgroup_create().
    - The base references were put too early by
      percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
      refs one extra time.  This wasn't noticeable because css's go
      through another RCU grace period before being freed.  Update
      cgroup_destroy_locked() to grab an extra reference before
      killing the refcnts.  This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
											 
										 
										
											2013-06-13 19:39:16 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  void  css_release ( struct  percpu_ref  * ref )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * css  = 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										container_of ( ref ,  struct  cgroup_subsys_state ,  refcnt ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									INIT_WORK ( & css - > destroy_work ,  css_release_work_fn ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									queue_work ( cgroup_destroy_wq ,  & css - > destroy_work ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller.  As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability.  For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable.  In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref.  css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s.  This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs.  The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id().  While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it.  Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
      the return value.  This causes two problems - the obvious lack
      of error handling and percpu_ref_init() being called from
      cgroup_init_subsys() before the allocators are up, which
      triggers warnings but doesn't cause actual problems as the
      refcnt isn't used for roots anyway.  Fix both by moving
      percpu_ref_init() to cgroup_create().
    - The base references were put too early by
      percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
      refs one extra time.  This wasn't noticeable because css's go
      through another RCU grace period before being freed.  Update
      cgroup_destroy_locked() to grab an extra reference before
      killing the refcnts.  This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
											 
										 
										
											2013-06-13 19:39:16 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  init_and_link_css ( struct  cgroup_subsys_state  * css ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											      struct  cgroup_subsys  * ss ,  struct  cgroup  * cgrp ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:49 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_get ( cgrp ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									memset ( css ,  0 ,  sizeof ( * css ) ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:40:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									css - > cgroup  =  cgrp ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									css - > ss  =  ss ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									INIT_LIST_HEAD ( & css - > sibling ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									INIT_LIST_HEAD ( & css - > children ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:49 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									css - > serial_nr  =  css_serial_nr_next + + ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 11:01:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( cgroup_parent ( cgrp ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										css - > parent  =  cgroup_css ( cgroup_parent ( cgrp ) ,  ss ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										css_get ( css - > parent ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css->refcnt clearing on cgroup removal optional
Currently, cgroup removal tries to drain all css references.  If there
are active css references, the removal logic waits and retries
->pre_detroy() until either all refs drop to zero or removal is
cancelled.
This semantics is unusual and adds non-trivial complexity to cgroup
core and IMHO is fundamentally misguided in that it couples internal
implementation details (references to internal data structure) with
externally visible operation (rmdir).  To userland, this is a behavior
peculiarity which is unnecessary and difficult to expect (css refs is
otherwise invisible from userland), and, to policy implementations,
this is an unnecessary restriction (e.g. blkcg wants to hold css refs
for caching purposes but can't as that becomes visible as rmdir hang).
Unfortunately, memcg currently depends on ->pre_destroy() retrials and
cgroup removal vetoing and can't be immmediately switched to the new
behavior.  This patch introduces the new behavior of not waiting for
css refs to drain and maintains the old behavior for subsystems which
have __DEPRECATED_clear_css_refs set.
Once, memcg is updated, we can drop the code paths for the old
behavior as proposed in the following patch.  Note that the following
patch is incorrect in that dput work item is in cgroup and may lose
some of dputs when multiples css's are released back-to-back, and
__css_put() triggers check_for_release() when refcnt reaches 0 instead
of 1; however, it shows what part can be removed.
  http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251
Note that, in not-too-distant future, cgroup core will start emitting
warning messages for subsys which require the old behavior, so please
get moving.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
											 
										 
										
											2012-04-01 12:09:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-26 18:40:56 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									BUG_ON ( cgroup_css ( cgrp ,  ss ) ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-07-31 16:16:40 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/* invoke ->css_online() on a new CSS and mark it online if successful */  
						 
					
						
							
								
									
										
										
										
											2013-08-13 11:01:55 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  online_css ( struct  cgroup_subsys_state  * css )  
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:37 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-08-13 11:01:55 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss  =  css - > ss ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:38 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  ret  =  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:37 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:38 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ss - > css_online ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										ret  =  ss - > css_online ( css ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 20:22:50 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! ret )  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										css - > flags  | =  CSS_ONLINE ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										rcu_assign_pointer ( css - > cgroup - > subsys [ ss - > id ] ,  css ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 20:22:50 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:38 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  ret ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:37 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-07-31 16:16:40 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/* if the CSS is online, invoke ->css_offline() on it and mark it offline */  
						 
					
						
							
								
									
										
										
										
											2013-08-13 11:01:55 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  offline_css ( struct  cgroup_subsys_state  * css )  
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:37 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-08-13 11:01:55 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss  =  css - > ss ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:37 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! ( css - > flags  &  CSS_ONLINE ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-03-12 15:35:59 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ss - > css_offline ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										ss - > css_offline ( css ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:37 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									css - > flags  & =  ~ CSS_ONLINE ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									RCU_INIT_POINTER ( css - > cgroup - > subsys [ ss - > id ] ,  NULL ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree.  The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers.  This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups.  Let's assume a hierarchy like the following.
  root - A - B - C
               \ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A.  A's on B.  B's on C and D.  This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent.  In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled.  Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control".  Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups.  In other words, only
leaf csses may contain tasks.  This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two.  Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated.  After
enabling, the whole subtree of a child should point to the new css of
the child.  After disabling, the whole subtree of a child should point
to the cgroup's css.  This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
  controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
  '+' or '-' can be written to "cgroup.subtree_control".  The ones
  prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
  "cgroup.subtree_control" enables it and disabled iff no child's
  "cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
  "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
  some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
  automatically associated with the root cgroup of the default
  hierarchy.  All the controllers which are bound to the default
  hierarchy are listed in the read-only file "cgroup.controllers" in
  the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
  content is equal to that of "cgroup.subtree_control" of the parent.
  This indicates which controllers can be used in the cgroup's
  "cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state.  The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									wake_up_all ( & css - > cgroup - > offline_waitq ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:37 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  create_css  -  create  a  cgroup_subsys_state 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cgrp :  the  cgroup  new  css  will  be  associated  with 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ ss :  the  subsys  of  new  css 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Create  a  new  css  associated  with  @ cgrp  -  @ ss  pair .   On  success ,  the  new 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  css  is  online  and  installed  in  @ cgrp  with  all  interface  files  created . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Returns  0  on  success ,  - errno  on  failure . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  int  create_css ( struct  cgroup  * cgrp ,  struct  cgroup_subsys  * ss )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * parent  =  cgroup_parent ( cgrp ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:49 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * parent_css  =  cgroup_css ( parent ,  ss ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * css ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									int  err ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:49 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									css  =  ss - > css_alloc ( parent_css ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( IS_ERR ( css ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  PTR_ERR ( css ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									init_and_link_css ( css ,  ss ,  cgrp ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									err  =  percpu_ref_init ( & css - > refcnt ,  css_release ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( err ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-18 17:02:36 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										goto  err_free_css ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									err  =  cgroup_idr_alloc ( & ss - > css_idr ,  NULL ,  2 ,  0 ,  GFP_NOWAIT ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( err  <  0 ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										goto  err_free_percpu_ref ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									css - > id  =  err ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									err  =  cgroup_populate_dir ( cgrp ,  1  < <  ss - > id ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( err ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										goto  err_free_id ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/* @css is ready to be brought online now, make it visible */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:49 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									list_add_tail_rcu ( & css - > sibling ,  & parent_css - > children ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_idr_replace ( & ss - > css_idr ,  css ,  css - > id ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									err  =  online_css ( css ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( err ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:49 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										goto  err_list_del ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ss - > broken_hierarchy  & &  ! ss - > warned_broken_hierarchy  & & 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									    cgroup_parent ( parent ) )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:03 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										pr_warn ( " %s (%d) created nested cgroup for controller  \" %s \"  which has incomplete hierarchy support. Nested cgroups may change behavior in the future. \n " , 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:03 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											current - > comm ,  current - > pid ,  ss - > name ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ! strcmp ( ss - > name ,  " memory " ) ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-25 18:28:03 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											pr_warn ( " \" memory \"  requires setting use_hierarchy to 1 on the root \n " ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										ss - > warned_broken_hierarchy  =  true ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:49 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								err_list_del :  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									list_del_rcu ( & css - > sibling ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Merge branch 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "A lot updates for cgroup:
   - The biggest one is cgroup's conversion to kernfs.  cgroup took
     after the long abandoned vfs-entangled sysfs implementation and
     made it even more convoluted over time.  cgroup's internal objects
     were fused with vfs objects which also brought in vfs locking and
     object lifetime rules.  Naturally, there are places where vfs rules
     don't fit and nasty hacks, such as credential switching or lock
     dance interleaving inode mutex and cgroup_mutex with object serial
     number comparison thrown in to decide whether the operation is
     actually necessary, needed to be employed.
     After conversion to kernfs, internal object lifetime and locking
     rules are mostly isolated from vfs interactions allowing shedding
     of several nasty hacks and overall simplification.  This will also
     allow implmentation of operations which may affect multiple cgroups
     which weren't possible before as it would have required nesting
     i_mutexes.
   - Various simplifications including dropping of module support,
     easier cgroup name/path handling, simplified cgroup file type
     handling and task_cg_lists optimization.
   - Prepatory changes for the planned unified hierarchy, which is still
     a patchset away from being actually operational.  The dummy
     hierarchy is updated to serve as the default unified hierarchy.
     Controllers which aren't claimed by other hierarchies are
     associated with it, which BTW was what the dummy hierarchy was for
     anyway.
   - Various fixes from Li and others.  This pull request includes some
     patches to add missing slab.h to various subsystems.  This was
     triggered xattr.h include removal from cgroup.h.  cgroup.h
     indirectly got included a lot of files which brought in xattr.h
     which brought in slab.h.
  There are several merge commits - one to pull in kernfs updates
  necessary for converting cgroup (already in upstream through
  driver-core), others for interfering changes in the fixes branch"
* 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
  cgroup: remove useless argument from cgroup_exit()
  cgroup: fix spurious lockdep warning in cgroup_exit()
  cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
  cgroup: break kernfs active_ref protection in cgroup directory operations
  cgroup: fix cgroup_taskset walking order
  cgroup: implement CFTYPE_ONLY_ON_DFL
  cgroup: make cgrp_dfl_root mountable
  cgroup: drop const from @buffer of cftype->write_string()
  cgroup: rename cgroup_dummy_root and related names
  cgroup: move ->subsys_mask from cgroupfs_root to cgroup
  cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
  cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
  cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
  cgroup: reorganize cgroup bootstrapping
  cgroup: relocate setting of CGRP_DEAD
  cpuset: use rcu_read_lock() to protect task_cs()
  cgroup_freezer: document freezer_fork() subtleties
  cgroup: update cgroup_transfer_tasks() to either succeed or fail
  cgroup: drop task_lock() protection around task->cgroups
  cgroup: update how a newly forked task gets associated with css_set
  ...
											 
										 
										
											2014-04-03 13:05:42 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_clear_dir ( css - > cgroup ,  1  < <  css - > ss - > id ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								err_free_id :  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgroup_idr_remove ( & ss - > css_idr ,  css - > id ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-18 17:02:36 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								err_free_percpu_ref :  
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									percpu_ref_cancel_init ( & css - > refcnt ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-18 17:02:36 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								err_free_css :  
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									call_rcu ( & css - > rcu_head ,  css_free_rcu_fn ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  err ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_mkdir ( struct  kernfs_node  * parent_kn ,  const  char  * name ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											umode_t  mode ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * parent ,  * cgrp ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup_root  * root ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  kernfs_node  * kn ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  ssid ,  ret ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									parent  =  cgroup_kn_lock_live ( parent_kn ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! parent ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  - ENODEV ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									root  =  parent - > root ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 09:02:12 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* allocate the cgroup and its ID, 0 is reserved for the root */ 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:40:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgrp  =  kzalloc ( sizeof ( * cgrp ) ,  GFP_KERNEL ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! cgrp )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										ret  =  - ENOMEM ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										goto  out_unlock ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 16:05:46 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									ret  =  percpu_ref_init ( & cgrp - > self . refcnt ,  css_release ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ret ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										goto  out_free_cgrp ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 16:05:46 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Temporarily  set  the  pointer  to  NULL ,  so  idr_find ( )  won ' t  return 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  a  half - baked  cgroup . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgrp - > id  =  cgroup_idr_alloc ( & root - > cgroup_idr ,  NULL ,  2 ,  0 ,  GFP_NOWAIT ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 16:05:46 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( cgrp - > id  <  0 )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										ret  =  - ENOMEM ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										goto  out_cancel_ref ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-05 09:16:59 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2008-10-18 20:28:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									init_cgroup_housekeeping ( cgrp ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:00 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgrp - > self . parent  =  & parent - > self ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgrp - > root  =  root ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2008-03-04 14:28:19 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( notify_on_release ( parent ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										set_bit ( CGRP_NOTIFY_ON_RELEASE ,  & cgrp - > flags ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:38 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( test_bit ( CGRP_CPUSET_CLONE_CHILDREN ,  & parent - > flags ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										set_bit ( CGRP_CPUSET_CLONE_CHILDREN ,  & cgrp - > flags ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2010-10-27 15:33:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* create the directory */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									kn  =  kernfs_create_dir ( parent - > kn ,  name ,  mode ,  cgrp ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( IS_ERR ( kn ) )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										ret  =  PTR_ERR ( kn ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										goto  out_free_id ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgrp - > kn  =  kn ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:36 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  This  extra  ref  will  be  put  in  cgroup_free_fn ( )  and  guarantees 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  that  @ cgrp - > kn  is  always  accessible . 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:36 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									kernfs_get ( kn ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:49 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgrp - > self . serial_nr  =  css_serial_nr_next + + ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section.  This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive.  This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible.  A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling.  IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next.  This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order.  A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
											 
										 
										
											2013-05-24 10:55:38 +09:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:36 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* allocation complete, commit to creation */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									list_add_tail_rcu ( & cgrp - > self . sibling ,  & cgroup_parent ( cgrp ) - > self . children ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									atomic_inc ( & root - > nr_cgrps ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_get ( parent ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-04-08 14:35:02 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  @ cgrp  is  now  fully  operational .   If  something  fails  after  this 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  point ,  it ' ll  be  released  via  the  normal  destruction  path . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_idr_replace ( & root - > cgroup_idr ,  cgrp ,  cgrp - > id ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-07-31 09:50:50 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									ret  =  cgroup_kn_set_ugid ( kn ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ret ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										goto  out_destroy ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-07 16:44:47 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									ret  =  cgroup_addrm_files ( cgrp ,  cgroup_base_files ,  true ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ret ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										goto  out_destroy ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-28 16:24:11 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* let's create and online css's */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:57 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for_each_subsys ( ss ,  ssid )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( parent - > child_subsys_mask  &  ( 1  < <  ssid ) )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											ret  =  create_css ( cgrp ,  ss ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( ret ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												goto  out_destroy ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:57 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:16 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  On  the  default  hierarchy ,  a  child  doesn ' t  automatically  inherit 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  child_subsys_mask  from  the  parent .   Each  is  configured  manually . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! cgroup_on_dfl ( cgrp ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										cgrp - > child_subsys_mask  =  parent - > child_subsys_mask ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									kernfs_activate ( kn ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									ret  =  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									goto  out_unlock ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								out_free_id :  
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_idr_remove ( & root - > cgroup_idr ,  cgrp - > id ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								out_cancel_ref :  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									percpu_ref_cancel_init ( & cgrp - > self . refcnt ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								out_free_cgrp :  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:40:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									kfree ( cgrp ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								out_unlock :  
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_kn_unlock ( parent_kn ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  ret ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:38 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								out_destroy :  
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:38 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_destroy_locked ( cgrp ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									goto  out_unlock ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 20:22:50 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  This  is  called  when  the  refcnt  of  a  css  is  confirmed  to  be  killed . 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:01 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  css_tryget_online ( )  is  now  guaranteed  to  fail .   Tell  the  subsystem  to 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  initate  destruction  and  put  the  css  ref  from  kill_css ( ) . 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 20:22:50 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  void  css_killed_work_fn ( struct  work_struct  * work )  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller.  As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability.  For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable.  In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref.  css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s.  This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs.  The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id().  While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it.  Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
      the return value.  This causes two problems - the obvious lack
      of error handling and percpu_ref_init() being called from
      cgroup_init_subsys() before the allocators are up, which
      triggers warnings but doesn't cause actual problems as the
      refcnt isn't used for roots anyway.  Fix both by moving
      percpu_ref_init() to cgroup_create().
    - The base references were put too early by
      percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
      refs one extra time.  This wasn't noticeable because css's go
      through another RCU grace period before being freed.  Update
      cgroup_destroy_locked() to grab an extra reference before
      killing the refcnts.  This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
											 
										 
										
											2013-06-13 19:39:16 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-08-13 20:22:50 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * css  = 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										container_of ( work ,  struct  cgroup_subsys_state ,  destroy_work ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller.  As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability.  For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable.  In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref.  css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s.  This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs.  The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id().  While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it.  Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
      the return value.  This causes two problems - the obvious lack
      of error handling and percpu_ref_init() being called from
      cgroup_init_subsys() before the allocators are up, which
      triggers warnings but doesn't cause actual problems as the
      refcnt isn't used for roots anyway.  Fix both by moving
      percpu_ref_init() to cgroup_create().
    - The base references were put too early by
      percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
      refs one extra time.  This wasn't noticeable because css's go
      through another RCU grace period before being freed.  Update
      cgroup_destroy_locked() to grab an extra reference before
      killing the refcnts.  This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
											 
										 
										
											2013-06-13 19:39:16 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 20:22:50 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_lock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 20:22:50 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									offline_css ( css ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 20:22:50 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_unlock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 20:22:50 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									css_put ( css ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller.  As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability.  For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable.  In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref.  css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s.  This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs.  The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id().  While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it.  Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
      the return value.  This causes two problems - the obvious lack
      of error handling and percpu_ref_init() being called from
      cgroup_init_subsys() before the allocators are up, which
      triggers warnings but doesn't cause actual problems as the
      refcnt isn't used for roots anyway.  Fix both by moving
      percpu_ref_init() to cgroup_create().
    - The base references were put too early by
      percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
      refs one extra time.  This wasn't noticeable because css's go
      through another RCU grace period before being freed.  Update
      cgroup_destroy_locked() to grab an extra reference before
      killing the refcnts.  This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
											 
										 
										
											2013-06-13 19:39:16 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 20:22:50 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/* css kill confirmation processing requires process context, bounce */  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  void  css_killed_ref_fn ( struct  percpu_ref  * ref )  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller.  As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability.  For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable.  In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref.  css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s.  This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs.  The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id().  While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it.  Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
      the return value.  This causes two problems - the obvious lack
      of error handling and percpu_ref_init() being called from
      cgroup_init_subsys() before the allocators are up, which
      triggers warnings but doesn't cause actual problems as the
      refcnt isn't used for roots anyway.  Fix both by moving
      percpu_ref_init() to cgroup_create().
    - The base references were put too early by
      percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
      refs one extra time.  This wasn't noticeable because css's go
      through another RCU grace period before being freed.  Update
      cgroup_destroy_locked() to grab an extra reference before
      killing the refcnts.  This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
											 
										 
										
											2013-06-13 19:39:16 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * css  = 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										container_of ( ref ,  struct  cgroup_subsys_state ,  refcnt ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 20:22:50 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									INIT_WORK ( & css - > destroy_work ,  css_killed_work_fn ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use a dedicated workqueue for cgroup destruction
Since be44562613851 ("cgroup: remove synchronize_rcu() from
cgroup_diput()"), cgroup destruction path makes use of workqueue.  css
freeing is performed from a work item from that point on and a later
commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two
steps"), moves css offlining to workqueue too.
As cgroup destruction isn't depended upon for memory reclaim, the
destruction work items were put on the system_wq; unfortunately, some
controller may block in the destruction path for considerable duration
while holding cgroup_mutex.  As large part of destruction path is
synchronized through cgroup_mutex, when combined with high rate of
cgroup removals, this has potential to fill up system_wq's max_active
of 256.
Also, it turns out that memcg's css destruction path ends up queueing
and waiting for work items on system_wq through work_on_cpu().  If
such operation happens while system_wq is fully occupied by cgroup
destruction work items, work_on_cpu() can't make forward progress
because system_wq is full and other destruction work items on
system_wq can't make forward progress because the work item waiting
for work_on_cpu() is holding cgroup_mutex, leading to deadlock.
This can be fixed by queueing destruction work items on a separate
workqueue.  This patch creates a dedicated workqueue -
cgroup_destroy_wq - for this purpose.  As these work items shouldn't
have inter-dependencies and mostly serialized by cgroup_mutex anyway,
giving high concurrency level doesn't buy anything and the workqueue's
@max_active is set to 1 so that destruction work items are executed
one by one on each CPU.
Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
cgroup_destroy_wq can't be allocated from cgroup_init().  Do it from a
separate core_initcall().  In the future, we probably want to reorder
so that workqueue init happens before cgroup_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Shawn Bohrer <shawn.bohrer@gmail.com>
Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
Cc: stable@vger.kernel.org # v3.9+
											 
										 
										
											2013-11-22 17:14:39 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									queue_work ( cgroup_destroy_wq ,  & css - > destroy_work ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller.  As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability.  For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable.  In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref.  css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s.  This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs.  The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id().  While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it.  Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
      the return value.  This causes two problems - the obvious lack
      of error handling and percpu_ref_init() being called from
      cgroup_init_subsys() before the allocators are up, which
      triggers warnings but doesn't cause actual problems as the
      refcnt isn't used for roots anyway.  Fix both by moving
      percpu_ref_init() to cgroup_create().
    - The base references were put too early by
      percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
      refs one extra time.  This wasn't noticeable because css's go
      through another RCU grace period before being freed.  Update
      cgroup_destroy_locked() to grab an extra reference before
      killing the refcnts.  This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
											 
										 
										
											2013-06-13 19:39:16 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  kill_css  -  destroy  a  css 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ css :  css  to  destroy 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  This  function  initiates  destruction  of  @ css  by  removing  cgroup  interface 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  files  and  putting  its  base  reference .   - > css_offline ( )  will  be  invoked 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:11:01 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  asynchronously  once  css_tryget_online ( )  is  guaranteed  to  fail  and  when 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  the  reference  count  reaches  zero ,  @ css  will  be  released . 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  void  kill_css ( struct  cgroup_subsys_state  * css )  
						 
					
						
							
								
									
										
										
										
											2013-08-13 20:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  This  must  happen  before  css  is  disassociated  with  its  cgroup . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  See  seq_css ( )  for  details . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_clear_dir ( css - > cgroup ,  1  < <  css - > ss - > id ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 20:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 20:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Killing  would  put  the  base  ref ,  but  we  need  to  keep  it  alive 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  until  after  - > css_offline ( ) . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									css_get ( css ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  cgroup  core  guarantees  that ,  by  the  time  - > css_offline ( )  is 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  invoked ,  no  new  css  reference  will  be  given  out  via 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:11:01 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  css_tryget_online ( ) .   We  can ' t  simply  call  percpu_ref_kill ( )  and 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 20:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  proceed  to  offlining  css ' s  because  percpu_ref_kill ( )  doesn ' t 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  guarantee  that  the  ref  is  seen  as  killed  on  all  CPUs  on  return . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Use  percpu_ref_kill_and_confirm ( )  to  get  notifications  as  each 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  css  is  confirmed  to  be  seen  as  killed  on  all  CPUs . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									percpu_ref_kill_and_confirm ( & css - > refcnt ,  css_killed_ref_fn ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller.  As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability.  For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable.  In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref.  css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s.  This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs.  The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id().  While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it.  Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
      the return value.  This causes two problems - the obvious lack
      of error handling and percpu_ref_init() being called from
      cgroup_init_subsys() before the allocators are up, which
      triggers warnings but doesn't cause actual problems as the
      refcnt isn't used for roots anyway.  Fix both by moving
      percpu_ref_init() to cgroup_create().
    - The base references were put too early by
      percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
      refs one extra time.  This wasn't noticeable because css's go
      through another RCU grace period before being freed.  Update
      cgroup_destroy_locked() to grab an extra reference before
      killing the refcnts.  This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
											 
										 
										
											2013-06-13 19:39:16 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_destroy_locked  -  the  first  stage  of  cgroup  destruction 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ cgrp :  cgroup  to  be  destroyed 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  css ' s  make  use  of  percpu  refcnts  whose  killing  latency  shouldn ' t  be 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  exposed  to  userland  and  are  RCU  protected .   Also ,  cgroup  core  needs  to 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:11:01 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  guarantee  that  css_tryget_online ( )  won ' t  succeed  by  the  time 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  - > css_offline ( )  is  invoked .   To  satisfy  all  the  requirements , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  destruction  is  implemented  in  the  following  two  steps . 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller.  As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability.  For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable.  In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref.  css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s.  This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs.  The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id().  While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it.  Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
      the return value.  This causes two problems - the obvious lack
      of error handling and percpu_ref_init() being called from
      cgroup_init_subsys() before the allocators are up, which
      triggers warnings but doesn't cause actual problems as the
      refcnt isn't used for roots anyway.  Fix both by moving
      percpu_ref_init() to cgroup_create().
    - The base references were put too early by
      percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
      refs one extra time.  This wasn't noticeable because css's go
      through another RCU grace period before being freed.  Update
      cgroup_destroy_locked() to grab an extra reference before
      killing the refcnts.  This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
											 
										 
										
											2013-06-13 19:39:16 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  s1 .  Verify  @ cgrp  can  be  destroyed  and  mark  it  dying .   Remove  all 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *      userland  visible  parts  and  start  killing  the  percpu  refcnts  of 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *      css ' s .   Set  up  so  that  the  next  stage  will  be  kicked  off  once  all 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *      the  percpu  refcnts  are  confirmed  to  be  killed . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  s2 .  Invoke  - > css_offline ( ) ,  mark  the  cgroup  dead  and  proceed  with  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *      rest  of  destruction .   Once  all  cgroup  references  are  gone ,  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *      cgroup  is  RCU - freed . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  This  function  implements  s1 .   After  this  step ,  @ cgrp  is  gone  as  far  as 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  the  userland  is  concerned  and  a  new  cgroup  with  the  same  name  may  be 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  created .   As  cgroup  doesn ' t  care  about  the  names  internally ,  this 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  doesn ' t  cause  any  problem . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:37 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_destroy_locked ( struct  cgroup  * cgrp )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									__releases ( & cgroup_mutex )  __acquires ( & cgroup_mutex ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * css ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:54 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									bool  empty ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  ssid ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:37 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									lockdep_assert_held ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:54 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  css_set_rwsem  synchronizes  access  to  - > cset_links  and  prevents 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  @ cgrp  from  being  removed  while  put_css_set ( )  is  in  progress . 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:54 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									down_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-28 16:31:23 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									empty  =  list_empty ( & cgrp - > cset_links ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									up_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:54 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! empty ) 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
										return  - EBUSY ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-02-23 15:24:09 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-28 16:31:23 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  Make  sure  there ' s  no  live  children .   We  can ' t  test  emptiness  of 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  - > self . children  as  dead  children  linger  on  it  while  being 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  drained ;  otherwise ,  " rmdir parent/child parent "  may  fail . 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-28 16:31:23 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:52 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( css_has_online_children ( & cgrp - > self ) ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-28 16:31:23 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  - EBUSY ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-13 19:27:41 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Mark  @ cgrp  dead .   This  prevents  further  task  migration  and  child 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:49 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  creation  by  disabling  cgroup_lock_live_group ( ) . 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-13 19:27:41 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgrp - > self . flags  & =  ~ CSS_ONLINE ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:01 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* initiate massacre of all css's */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:56 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for_each_css ( css ,  ssid ,  cgrp ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										kill_css ( css ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-13 19:27:41 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:51 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* CSS_ONLINE is clear, remove from ->release_list for the last time */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-13 19:27:41 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									raw_spin_lock ( & release_list_lock ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! list_empty ( & cgrp - > release_list ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										list_del_init ( & cgrp - > release_list ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									raw_spin_unlock ( & release_list_lock ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  Remove  @ cgrp  directory  along  with  the  base  files .   @ cgrp  has  an 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  extra  ref  on  its  kn . 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 20:22:50 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									kernfs_remove ( cgrp - > kn ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 20:22:50 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:48 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									set_bit ( CGRP_RELEASABLE ,  & cgroup_parent ( cgrp ) - > flags ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									check_for_release ( cgroup_parent ( cgrp ) ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:01 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* put the base reference */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									percpu_ref_kill ( & cgrp - > self . refcnt ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-13 19:27:41 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-13 19:27:42 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								} ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_rmdir ( struct  kernfs_node  * kn )  
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:37 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  ret  =  0 ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:37 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgrp  =  cgroup_kn_lock_live ( kn ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! cgrp ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgroup_get ( cgrp ) ; 	/* for @kn->priv clearing */ 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:37 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									ret  =  cgroup_destroy_locked ( cgrp ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_kn_unlock ( kn ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  There  are  two  control  paths  which  try  to  determine  cgroup  from 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  dentry  without  going  through  kernfs  -  cgroupstats_build ( )  and 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  css_tryget_online_from_dir ( ) .   Those  are  supported  by  RCU 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  protecting  clearing  of  cgrp - > kn - > priv  backpointer ,  which  should 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  happen  after  all  files  under  it  have  been  removed . 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! ret ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										RCU_INIT_POINTER ( * ( void  __rcu  __force  * * ) & kn - > priv ,  NULL ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_put ( cgrp ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:37 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  ret ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  struct  kernfs_syscall_ops  cgroup_kf_syscall_ops  =  {  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. remount_fs 		=  cgroup_remount , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. show_options 		=  cgroup_show_options , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. mkdir 			=  cgroup_mkdir , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. rmdir 			=  cgroup_rmdir , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. rename 			=  cgroup_rename , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								} ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  __init  cgroup_init_subsys ( struct  cgroup_subsys  * ss ,  bool  early )  
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * css ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-11-14 16:58:54 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									printk ( KERN_INFO  " Initializing cgroup subsys %s \n " ,  ss - > name ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:36 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_lock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									idr_init ( & ss - > css_idr ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:48 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									INIT_LIST_HEAD ( & ss - > cfts ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* Create the root cgroup state for this subsystem */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									ss - > root  =  & cgrp_dfl_root ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									css  =  ss - > css_alloc ( cgroup_css ( & cgrp_dfl_root . cgrp ,  ss ) ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									/* We don't handle early failures gracefully */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									BUG_ON ( IS_ERR ( css ) ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									init_and_link_css ( css ,  ss ,  & cgrp_dfl_root . cgrp ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:47 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Root  csses  are  never  destroyed  and  we  can ' t  initialize 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  percpu_ref  during  early  init .   Disable  refcnting . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									css - > flags  | =  CSS_NO_REF ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( early )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-14 09:15:02 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										/* allocation can't be done safely during early init */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										css - > id  =  1 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									}  else  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										css - > id  =  cgroup_idr_alloc ( & ss - > css_idr ,  css ,  1 ,  2 ,  GFP_KERNEL ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										BUG_ON ( css - > id  <  0 ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:13 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* Update the init_css_set to contain a subsys
 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  pointer  to  this  state  -  since  the  subsystem  is 
							 
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:13 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  newly  registered ,  all  tasks  and  hence  the 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  init_css_set  is  in  the  subsystem ' s  root  cgroup .  */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									init_css_set . subsys [ ss - > id ]  =  css ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									need_forkexit_callback  | =  ss - > fork  | |  ss - > exit ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:13 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* At system boot, before all subsystems have been
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  registered ,  no  tasks  have  been  forked ,  so  we  don ' t 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  need  to  invoke  fork  callbacks  here .  */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									BUG_ON ( ! list_empty ( & init_task . tasks ) ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 20:22:50 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									BUG_ON ( online_css ( css ) ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-11-09 09:12:29 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2010-03-10 15:22:09 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_unlock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
										
										
											2008-02-23 15:24:09 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  cgroup_init_early  -  cgroup  initialization  at  system  boot 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Initialize  cgroups  at  system  boot ,  and  initialize  any 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  subsystems  that  request  early  init . 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								int  __init  cgroup_init_early ( void )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:55 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									static  struct  cgroup_sb_opts  __initdata  opts  = 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										{  . flags  =  CGRP_ROOT_SANE_BEHAVIOR  } ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									int  i ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									init_cgroup_root ( & cgrp_dfl_root ,  & opts ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:47 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgrp_dfl_root . cgrp . self . flags  | =  CSS_NO_REF ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-21 15:52:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									RCU_INIT_POINTER ( init_task . cgroups ,  & init_css_set ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for_each_subsys ( ss ,  i )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										WARN ( ! ss - > css_alloc  | |  ! ss - > css_free  | |  ss - > name  | |  ss - > id , 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										     " invalid cgroup_subsys %d:%s css_alloc=%p css_free=%p name:id=%d:%s \n " , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										     i ,  cgroup_subsys_name [ i ] ,  ss - > css_alloc ,  ss - > css_free , 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										     ss - > id ,  ss - > name ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										WARN ( strlen ( cgroup_subsys_name [ i ] )  >  MAX_CGROUP_TYPE_NAMELEN , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										     " cgroup_subsys_name %s too long \n " ,  cgroup_subsys_name [ i ] ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										ss - > id  =  i ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										ss - > name  =  cgroup_subsys_name [ i ] ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ss - > early_init ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											cgroup_init_subsys ( ss ,  true ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
										
										
											2008-02-23 15:24:09 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  cgroup_init  -  cgroup  initialization 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Register  cgroup  filesystem  and  / proc  file ,  and  initialize 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  any  subsystems  that  didn ' t  request  early  init . 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								int  __init  cgroup_init ( void )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-01-10 11:49:27 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									unsigned  long  key ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  ssid ,  err ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									BUG_ON ( cgroup_init_cftypes ( NULL ,  cgroup_base_files ) ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-04-14 11:36:57 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_lock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* Add init_css_set to the hash table */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									key  =  css_set_hash ( init_css_set . subsys ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									hash_add ( css_set_table ,  & init_css_set . hlist ,  key ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									BUG_ON ( cgroup_setup_root ( & cgrp_dfl_root ,  0 ) ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-07-31 09:50:50 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-04-14 11:36:57 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_unlock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for_each_subsys ( ss ,  ssid )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ss - > early_init )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											struct  cgroup_subsys_state  * css  = 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												init_css_set . subsys [ ss - > id ] ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											css - > id  =  cgroup_idr_alloc ( & ss - > css_idr ,  css ,  1 ,  2 , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
														   GFP_KERNEL ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											BUG_ON ( css - > id  <  0 ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										}  else  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											cgroup_init_subsys ( ss ,  false ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:15 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										list_add_tail ( & init_css_set . e_cset_node [ ssid ] , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											      & cgrp_dfl_root . cgrp . e_csets [ ssid ] ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/*
 
							 
						 
					
						
							
								
									
										
										
										
											2014-06-05 17:16:30 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										 *  Setting  dfl_root  subsys_mask  needs  to  consider  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  disabled  flag  and  cftype  registration  needs  kmalloc , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  both  of  which  aren ' t  available  during  early_init . 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-06-05 17:16:30 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ! ss - > disabled )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											cgrp_dfl_root . subsys_mask  | =  1  < <  ss - > id ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											WARN_ON ( cgroup_add_cftypes ( ss ,  ss - > base_cftypes ) ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-06-05 17:16:30 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
										
										
											2010-08-05 13:53:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgroup_kobj  =  kobject_create_and_add ( " cgroup " ,  fs_kobj ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! cgroup_kobj ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  - ENOMEM ; 
							 
						 
					
						
							
								
									
										
										
										
											2010-08-05 13:53:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
									err  =  register_filesystem ( & cgroup_fs_type ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2010-08-05 13:53:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( err  <  0 )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										kobject_put ( cgroup_kobj ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  err ; 
							 
						 
					
						
							
								
									
										
										
										
											2010-08-05 13:53:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2008-04-29 01:00:08 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									proc_create ( " cgroups " ,  0 ,  NULL ,  & proc_cgroupstats_operations ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others.  These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
 conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
 management systems is substantially reduced, since it doesn't need
 to provide process grouping/containment, hence improving their
 chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2007-10-18 23:39:30 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use a dedicated workqueue for cgroup destruction
Since be44562613851 ("cgroup: remove synchronize_rcu() from
cgroup_diput()"), cgroup destruction path makes use of workqueue.  css
freeing is performed from a work item from that point on and a later
commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two
steps"), moves css offlining to workqueue too.
As cgroup destruction isn't depended upon for memory reclaim, the
destruction work items were put on the system_wq; unfortunately, some
controller may block in the destruction path for considerable duration
while holding cgroup_mutex.  As large part of destruction path is
synchronized through cgroup_mutex, when combined with high rate of
cgroup removals, this has potential to fill up system_wq's max_active
of 256.
Also, it turns out that memcg's css destruction path ends up queueing
and waiting for work items on system_wq through work_on_cpu().  If
such operation happens while system_wq is fully occupied by cgroup
destruction work items, work_on_cpu() can't make forward progress
because system_wq is full and other destruction work items on
system_wq can't make forward progress because the work item waiting
for work_on_cpu() is holding cgroup_mutex, leading to deadlock.
This can be fixed by queueing destruction work items on a separate
workqueue.  This patch creates a dedicated workqueue -
cgroup_destroy_wq - for this purpose.  As these work items shouldn't
have inter-dependencies and mostly serialized by cgroup_mutex anyway,
giving high concurrency level doesn't buy anything and the workqueue's
@max_active is set to 1 so that destruction work items are executed
one by one on each CPU.
Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
cgroup_destroy_wq can't be allocated from cgroup_init().  Do it from a
separate core_initcall().  In the future, we probably want to reorder
so that workqueue init happens before cgroup_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Shawn Bohrer <shawn.bohrer@gmail.com>
Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
Cc: stable@vger.kernel.org # v3.9+
											 
										 
										
											2013-11-22 17:14:39 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  __init  cgroup_wq_init ( void )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  There  isn ' t  much  point  in  executing  destruction  path  in 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  parallel .   Good  chunk  is  serialized  with  cgroup_mutex  anyway . 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 19:06:19 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  Use  1  for  @ max_active . 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use a dedicated workqueue for cgroup destruction
Since be44562613851 ("cgroup: remove synchronize_rcu() from
cgroup_diput()"), cgroup destruction path makes use of workqueue.  css
freeing is performed from a work item from that point on and a later
commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two
steps"), moves css offlining to workqueue too.
As cgroup destruction isn't depended upon for memory reclaim, the
destruction work items were put on the system_wq; unfortunately, some
controller may block in the destruction path for considerable duration
while holding cgroup_mutex.  As large part of destruction path is
synchronized through cgroup_mutex, when combined with high rate of
cgroup removals, this has potential to fill up system_wq's max_active
of 256.
Also, it turns out that memcg's css destruction path ends up queueing
and waiting for work items on system_wq through work_on_cpu().  If
such operation happens while system_wq is fully occupied by cgroup
destruction work items, work_on_cpu() can't make forward progress
because system_wq is full and other destruction work items on
system_wq can't make forward progress because the work item waiting
for work_on_cpu() is holding cgroup_mutex, leading to deadlock.
This can be fixed by queueing destruction work items on a separate
workqueue.  This patch creates a dedicated workqueue -
cgroup_destroy_wq - for this purpose.  As these work items shouldn't
have inter-dependencies and mostly serialized by cgroup_mutex anyway,
giving high concurrency level doesn't buy anything and the workqueue's
@max_active is set to 1 so that destruction work items are executed
one by one on each CPU.
Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
cgroup_destroy_wq can't be allocated from cgroup_init().  Do it from a
separate core_initcall().  In the future, we probably want to reorder
so that workqueue init happens before cgroup_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Shawn Bohrer <shawn.bohrer@gmail.com>
Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
Cc: stable@vger.kernel.org # v3.9+
											 
										 
										
											2013-11-22 17:14:39 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  We  would  prefer  to  do  this  in  cgroup_init ( )  above ,  but  that 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  is  called  before  init_workqueues ( ) :  so  leave  this  until  after . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 19:06:19 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cgroup_destroy_wq  =  alloc_workqueue ( " cgroup_destroy " ,  0 ,  1 ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use a dedicated workqueue for cgroup destruction
Since be44562613851 ("cgroup: remove synchronize_rcu() from
cgroup_diput()"), cgroup destruction path makes use of workqueue.  css
freeing is performed from a work item from that point on and a later
commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two
steps"), moves css offlining to workqueue too.
As cgroup destruction isn't depended upon for memory reclaim, the
destruction work items were put on the system_wq; unfortunately, some
controller may block in the destruction path for considerable duration
while holding cgroup_mutex.  As large part of destruction path is
synchronized through cgroup_mutex, when combined with high rate of
cgroup removals, this has potential to fill up system_wq's max_active
of 256.
Also, it turns out that memcg's css destruction path ends up queueing
and waiting for work items on system_wq through work_on_cpu().  If
such operation happens while system_wq is fully occupied by cgroup
destruction work items, work_on_cpu() can't make forward progress
because system_wq is full and other destruction work items on
system_wq can't make forward progress because the work item waiting
for work_on_cpu() is holding cgroup_mutex, leading to deadlock.
This can be fixed by queueing destruction work items on a separate
workqueue.  This patch creates a dedicated workqueue -
cgroup_destroy_wq - for this purpose.  As these work items shouldn't
have inter-dependencies and mostly serialized by cgroup_mutex anyway,
giving high concurrency level doesn't buy anything and the workqueue's
@max_active is set to 1 so that destruction work items are executed
one by one on each CPU.
Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
cgroup_destroy_wq can't be allocated from cgroup_init().  Do it from a
separate core_initcall().  In the future, we probably want to reorder
so that workqueue init happens before cgroup_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Shawn Bohrer <shawn.bohrer@gmail.com>
Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
Cc: stable@vger.kernel.org # v3.9+
											 
										 
										
											2013-11-22 17:14:39 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									BUG_ON ( ! cgroup_destroy_wq ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-11-29 10:42:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Used  to  destroy  pidlists  and  separate  to  serve  as  flush  domain . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Cap  @ max_active  to  1  too . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgroup_pidlist_destroy_wq  =  alloc_workqueue ( " cgroup_pidlist_destroy " , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
														    0 ,  1 ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									BUG_ON ( ! cgroup_pidlist_destroy_wq ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: use a dedicated workqueue for cgroup destruction
Since be44562613851 ("cgroup: remove synchronize_rcu() from
cgroup_diput()"), cgroup destruction path makes use of workqueue.  css
freeing is performed from a work item from that point on and a later
commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two
steps"), moves css offlining to workqueue too.
As cgroup destruction isn't depended upon for memory reclaim, the
destruction work items were put on the system_wq; unfortunately, some
controller may block in the destruction path for considerable duration
while holding cgroup_mutex.  As large part of destruction path is
synchronized through cgroup_mutex, when combined with high rate of
cgroup removals, this has potential to fill up system_wq's max_active
of 256.
Also, it turns out that memcg's css destruction path ends up queueing
and waiting for work items on system_wq through work_on_cpu().  If
such operation happens while system_wq is fully occupied by cgroup
destruction work items, work_on_cpu() can't make forward progress
because system_wq is full and other destruction work items on
system_wq can't make forward progress because the work item waiting
for work_on_cpu() is holding cgroup_mutex, leading to deadlock.
This can be fixed by queueing destruction work items on a separate
workqueue.  This patch creates a dedicated workqueue -
cgroup_destroy_wq - for this purpose.  As these work items shouldn't
have inter-dependencies and mostly serialized by cgroup_mutex anyway,
giving high concurrency level doesn't buy anything and the workqueue's
@max_active is set to 1 so that destruction work items are executed
one by one on each CPU.
Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
cgroup_destroy_wq can't be allocated from cgroup_init().  Do it from a
separate core_initcall().  In the future, we probably want to reorder
so that workqueue init happens before cgroup_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Shawn Bohrer <shawn.bohrer@gmail.com>
Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
Cc: stable@vger.kernel.org # v3.9+
											 
										 
										
											2013-11-22 17:14:39 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								core_initcall ( cgroup_wq_init ) ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  proc_cgroup_show ( ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *   -  Print  task ' s  cgroup  paths  into  seq_file ,  one  line  for  each  hierarchy 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *   -  Used  for  / proc / < pid > / cgroup . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								/* TODO: Use a proper seq_file iterator */  
						 
					
						
							
								
									
										
										
										
											2013-04-19 23:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								int  proc_cgroup_show ( struct  seq_file  * m ,  void  * v )  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  pid  * pid ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  task_struct  * tsk ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									char  * buf ,  * path ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  retval ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_root  * root ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									retval  =  - ENOMEM ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									buf  =  kmalloc ( PATH_MAX ,  GFP_KERNEL ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! buf ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										goto  out ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									retval  =  - ESRCH ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									pid  =  m - > private ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									tsk  =  get_pid_task ( pid ,  PIDTYPE_PID ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! tsk ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										goto  out_free ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									retval  =  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									mutex_lock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									down_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									for_each_root ( root )  { 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										struct  cgroup_subsys  * ss ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:40:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										struct  cgroup  * cgrp ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:57 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										int  ssid ,  count  =  0 ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:55 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( root  = =  & cgrp_dfl_root  & &  ! cgrp_dfl_root_visible ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:53 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:23 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										seq_printf ( m ,  " %d: " ,  root - > hierarchy_id ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:57 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										for_each_subsys ( ss ,  ssid ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-04-23 11:13:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											if  ( root - > subsys_mask  &  ( 1  < <  ssid ) ) 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-06 15:11:57 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												seq_printf ( m ,  " %s%s " ,  count + +  ?  " , "  :  " " ,  ss - > name ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:19 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( strlen ( root - > name ) ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											seq_printf ( m ,  " %sname=%s " ,  count  ?  " , "  :  " " , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												   root - > name ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										seq_putc ( m ,  ' : ' ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										cgrp  =  task_cgroup_from_root ( tsk ,  root ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										path  =  cgroup_path ( cgrp ,  buf ,  PATH_MAX ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! path )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											retval  =  - ENAMETOOLONG ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											goto  out_unlock ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										seq_puts ( m ,  path ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										seq_putc ( m ,  ' \n ' ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								out_unlock :  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									up_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_unlock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									put_task_struct ( tsk ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								out_free :  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									kfree ( buf ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								out :  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  retval ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								/* Display information about each subsystem and each hierarchy */  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  int  proc_cgroupstats_show ( struct  seq_file  * m ,  void  * v )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  i ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2008-04-04 14:29:57 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									seq_puts ( m ,  " #subsys_name \t hierarchy \t num_cgroups \t enabled \n " ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroups: revamp subsys array
This patch series provides the ability for cgroup subsystems to be
compiled as modules both within and outside the kernel tree.  This is
mainly useful for classifiers and subsystems that hook into components
that are already modules.  cls_cgroup and blkio-cgroup serve as the
example use cases for this feature.
It provides an interface cgroup_load_subsys() and cgroup_unload_subsys()
which modular subsystems can use to register and depart during runtime.
The net_cls classifier subsystem serves as the example for a subsystem
which can be converted into a module using these changes.
Patch #1 sets up the subsys[] array so its contents can be dynamic as
modules appear and (eventually) disappear.  Iterations over the array are
modified to handle when subsystems are absent, and the dynamic section of
the array is protected by cgroup_mutex.
Patch #2 implements an interface for modules to load subsystems, called
cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module
pointer in struct cgroup_subsys.
Patch #3 adds a mechanism for unloading modular subsystems, which includes
a more advanced rework of the rudimentary reference counting introduced in
patch 2.
Patch #4 modifies the net_cls subsystem, which already had some module
declarations, to be configurable as a module, which also serves as a
simple proof-of-concept.
Part of implementing patches 2 and 4 involved updating css pointers in
each css_set when the module appears or leaves.  In doing this, it was
discovered that css_sets always remain linked to the dummy cgroup,
regardless of whether or not any subsystems are actually bound to it
(i.e., not mounted on an actual hierarchy).  The subsystem loading and
unloading code therefore should keep in mind the special cases where the
added subsystem is the only one in the dummy cgroup (and therefore all
css_sets need to be linked back into it) and where the removed subsys was
the only one in the dummy cgroup (and therefore all css_sets should be
unlinked from it) - however, as all css_sets always stay attached to the
dummy cgroup anyway, these cases are ignored.  Any fix that addresses this
issue should also make sure these cases are addressed in the subsystem
loading and unloading code.
This patch:
Make subsys[] able to be dynamically populated to support modular
subsystems
This patch reworks the way the subsys[] array is used so that subsystems
can register themselves after boot time, and enables the internals of
cgroups to be able to handle when subsystems are not present or may
appear/disappear.
Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2010-03-10 15:22:07 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  ideally  we  don ' t  want  subsystems  moving  around  while  we  do  this . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  cgroup_mutex  is  also  necessary  to  guarantee  an  atomic  snapshot  of 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  subsys / hierarchy  state . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_lock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									for_each_subsys ( ss ,  i ) 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:23 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										seq_printf ( m ,  " %s \t %d \t %d \t %d \n " , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											   ss - > name ,  ss - > root - > hierarchy_id , 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											   atomic_read ( & ss - > root - > nr_cgrps ) ,  ! ss - > disabled ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_unlock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  int  cgroupstats_open ( struct  inode  * inode ,  struct  file  * file )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2008-03-29 03:07:28 +00:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  single_open ( file ,  proc_cgroupstats_show ,  NULL ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-10-01 15:43:56 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  const  struct  file_operations  proc_cgroupstats_operations  =  {  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:35 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									. open  =  cgroupstats_open , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. read  =  seq_read , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. llseek  =  seq_lseek , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. release  =  single_release , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								} ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  cgroup_fork  -  initialize  cgroup  related  fields  during  copy_process ( ) 
							 
						 
					
						
							
								
									
										
										
										
											2008-02-23 15:24:09 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ child :  pointer  to  task_struct  of  forking  parent  process . 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  A  task  is  associated  with  the  init_css_set  until  cgroup_post_fork ( ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  attaches  it  to  the  parent ' s  css_set .   Empty  cg_list  indicates  that 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ child  isn ' t  holding  reference  to  its  css_set . 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								void  cgroup_fork ( struct  task_struct  * child )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									RCU_INIT_POINTER ( child - > cgroups ,  & init_css_set ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									INIT_LIST_HEAD ( & child - > cg_list ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
										
										
											2008-02-23 15:24:09 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  cgroup_post_fork  -  called  on  a  new  task  after  adding  it  to  the  task  list 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ child :  the  task  in  question 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2012-10-16 15:03:14 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  Adds  the  task  to  the  list  running  through  its  css_set  if  necessary  and 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  call  the  subsystem  fork ( )  callbacks .   Has  to  be  after  the  task  is 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  visible  on  the  task  list  in  case  we  race  with  the  first  call  to 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:26 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  cgroup_task_iter_start ( )  -  to  guarantee  that  the  new  task  ends  up  on  its 
							 
						 
					
						
							
								
									
										
										
										
											2012-10-16 15:03:14 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  list . 
							 
						 
					
						
							
								
									
										
										
										
											2008-02-23 15:24:09 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								void  cgroup_post_fork ( struct  task_struct  * child )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-10-16 15:03:14 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  i ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-02-08 03:37:27 +01:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  This  may  race  against  cgroup_enable_task_cg_links ( ) .   As  that 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  function  sets  use_task_css_set_links  before  grabbing 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  tasklist_lock  and  we  just  went  through  tasklist_lock  to  add 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  @ child ,  it ' s  guaranteed  that  either  we  see  the  set 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  use_task_css_set_links  or  cgroup_enable_task_cg_lists ( )  sees 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  @ child  during  its  iteration . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  If  we  won  the  race ,  @ child  is  associated  with  % current ' s 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  css_set .   Grabbing  css_set_rwsem  guarantees  both  that  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  association  is  stable ,  and ,  on  completion  of  the  parent ' s 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  migration ,  @ child  is  visible  in  the  source  of  migration  or 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  already  in  the  destination  cgroup .   This  guarantee  is  necessary 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  when  implementing  operations  which  need  to  migrate  all  tasks  of 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  a  cgroup  to  another . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Note  that  if  we  lose  to  cgroup_enable_task_cg_links ( ) ,  @ child 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  will  remain  in  init_css_set .   This  is  safe  because  all  tasks  are 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  in  the  init_css_set  before  cg_links  is  enabled  and  there ' s  no 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  operation  which  transfers  all  tasks  out  of  init_css_set . 
							 
						 
					
						
							
								
									
										
										
										
											2012-02-08 03:37:27 +01:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( use_task_css_set_links )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										struct  css_set  * cset ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										down_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										cset  =  task_css_set ( current ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( list_empty ( & child - > cg_list ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											rcu_assign_pointer ( child - > cgroups ,  cset ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											list_add ( & child - > cg_list ,  & cset - > tasks ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											get_css_set ( cset ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										up_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2012-10-16 15:03:14 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  Call  ss - > fork ( ) .   This  must  happen  after  @ child  is  linked  on 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  css_set ;  otherwise ,  @ child  might  change  state  between  - > fork ( ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  and  addition  to  css_set . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( need_forkexit_callback )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										for_each_subsys ( ss ,  i ) 
							 
						 
					
						
							
								
									
										
										
										
											2012-10-16 15:03:14 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											if  ( ss - > fork ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												ss - > fork ( child ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
									
										
										
										
											2012-10-16 15:03:14 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  cgroup_exit  -  detach  cgroup  from  exiting  task 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ tsk :  pointer  to  task_struct  of  exiting  process 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Description :  Detach  cgroup  from  @ tsk  and  release  it . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Note  that  cgroups  marked  notify_on_release  force  every  task  in 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  them  to  take  the  global  cgroup_mutex  mutex  when  exiting . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  This  could  impact  scaling  on  very  large  systems .   Be  reluctant  to 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  use  notify_on_release  cgroups  where  very  high  task  exit  scaling 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  is  required  on  large  systems . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  We  set  the  exiting  tasks  cgroup  to  the  root  cgroup  ( top_cgroup ) .   We 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  call  cgroup_exit ( )  while  the  task  is  still  competent  to  handle 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  notify_on_release ( ) ,  then  leave  the  task  attached  to  the  root  cgroup  in 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  each  hierarchy  for  the  remainder  of  its  exit .   No  need  to  bother  with 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  init_css_set  refcnting .   init_css_set  never  goes  away  and  we  can ' t  race 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-28 15:18:27 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  with  migration  path  -  PF_EXITING  is  visible  to  migration  path . 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-28 15:22:19 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								void  cgroup_exit ( struct  task_struct  * tsk )  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  css_set  * cset ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									bool  put_cset  =  false ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-02-07 17:02:20 +01:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  i ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  Unlink  from  @ tsk  from  its  css_set .   As  migration  path  can ' t  race 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  with  us ,  we  can  check  cg_list  without  grabbing  css_set_rwsem . 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! list_empty ( & tsk - > cg_list ) )  { 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										down_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										list_del_init ( & tsk - > cg_list ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										up_write ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										put_cset  =  true ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:36 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* Reassign the task to the init_css_set. */ 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-21 15:52:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cset  =  task_css_set ( tsk ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									RCU_INIT_POINTER ( tsk - > cgroups ,  & init_css_set ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-02-07 17:02:20 +01:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-28 15:22:19 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( need_forkexit_callback )  { 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										/* see cgroup_post_fork() for details */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										for_each_subsys ( ss ,  i )  { 
							 
						 
					
						
							
								
									
										
										
										
											2011-02-07 17:02:20 +01:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											if  ( ss - > exit )  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												struct  cgroup_subsys_state  * old_css  =  cset - > subsys [ i ] ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												struct  cgroup_subsys_state  * css  =  task_css ( tsk ,  i ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
												ss - > exit ( css ,  old_css ,  tsk ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-02-07 17:02:20 +01:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-25 10:04:03 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( put_cset ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										put_css_set ( cset ,  true ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:33 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:34 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:40:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  check_for_release ( struct  cgroup  * cgrp )  
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-05-16 13:22:52 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( cgroup_is_releasable ( cgrp )  & &  list_empty ( & cgrp - > cset_links )  & & 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									    ! css_has_online_children ( & cgrp - > self ) )  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-03-01 15:06:07 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  Control  Group  is  currently  removeable .  If  it ' s  not 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										 *  already  queued  for  a  userspace  notification ,  queue 
							 
						 
					
						
							
								
									
										
										
										
											2013-03-01 15:06:07 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										 *  it  now 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 */ 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										int  need_schedule_work  =  0 ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-03-01 15:06:07 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-07-25 16:47:45 +02:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										raw_spin_lock ( & release_list_lock ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:53 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ! cgroup_is_dead ( cgrp )  & & 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:40:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										    list_empty ( & cgrp - > release_list ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											list_add ( & cgrp - > release_list ,  & release_list ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											need_schedule_work  =  1 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
										
										
											2009-07-25 16:47:45 +02:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										raw_spin_unlock ( & release_list_lock ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( need_schedule_work ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											schedule_work ( & release_agent_work ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								/*
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Notify  userspace  when  a  cgroup  is  released ,  by  running  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  configured  release  agent  with  the  name  of  the  cgroup  ( path 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  relative  to  the  root  of  cgroup  file  system )  as  the  argument . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Most  likely ,  this  user  command  will  try  to  rmdir  this  cgroup . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  This  races  with  the  possibility  that  some  other  task  will  be 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  attached  to  this  cgroup  before  it  is  removed ,  or  that  some  other 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  user  task  will  ' mkdir '  a  child  cgroup  of  this  cgroup .   That ' s  ok . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  The  presumed  ' rmdir '  will  fail  quietly  if  this  cgroup  is  no  longer 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  unused ,  and  this  cgroup  will  be  reprieved  from  its  death  sentence , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  to  continue  to  serve  a  useful  existence .   Next  time  it ' s  released , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  we  will  get  notified  again ,  if  it  still  has  ' notify_on_release '  set . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  The  final  arg  to  call_usermodehelper ( )  is  UMH_WAIT_EXEC ,  which 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  means  only  wait  until  the  task  is  successfully  execve ( ) ' d .   The 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  separate  release  agent  task  is  forked  by  call_usermodehelper ( ) , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  then  control  in  this  thread  returns  here ,  without  waiting  for  the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  release  agent  task .   We  don ' t  bother  to  wait  because  the  caller  of 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  this  routine  has  no  use  for  the  exit  status  of  the  release  agent 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  task ,  so  no  sense  holding  our  caller  up  for  that . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  void  cgroup_release_agent ( struct  work_struct  * work )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									BUG_ON ( work  ! =  & release_agent_work ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									mutex_lock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-07-25 16:47:45 +02:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									raw_spin_lock ( & release_list_lock ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									while  ( ! list_empty ( & release_list ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										char  * argv [ 3 ] ,  * envp [ 3 ] ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										int  i ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										char  * pathbuf  =  NULL ,  * agentbuf  =  NULL ,  * path ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:40:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										struct  cgroup  * cgrp  =  list_entry ( release_list . next , 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
														    struct  cgroup , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
														    release_list ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:40:44 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										list_del_init ( & cgrp - > release_list ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-07-25 16:47:45 +02:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										raw_spin_unlock ( & release_list_lock ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										pathbuf  =  kmalloc ( PATH_MAX ,  GFP_KERNEL ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:46:59 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										if  ( ! pathbuf ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											goto  continue_free ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										path  =  cgroup_path ( cgrp ,  pathbuf ,  PATH_MAX ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! path ) 
							 
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:46:59 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											goto  continue_free ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										agentbuf  =  kstrdup ( cgrp - > root - > release_agent_path ,  GFP_KERNEL ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! agentbuf ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											goto  continue_free ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										i  =  0 ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:46:59 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										argv [ i + + ]  =  agentbuf ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										argv [ i + + ]  =  path ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										argv [ i ]  =  NULL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										i  =  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/* minimal command environment */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										envp [ i + + ]  =  " HOME=/ " ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										envp [ i + + ]  =  " PATH=/sbin:/bin:/usr/sbin:/usr/bin " ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										envp [ i ]  =  NULL ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										/* Drop the lock while we invoke the usermode helper,
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  since  the  exec  could  involve  hitting  disk  and  hence 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										 *  be  a  slow  process  */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										mutex_unlock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										call_usermodehelper ( argv [ 0 ] ,  argv ,  envp ,  UMH_WAIT_EXEC ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										mutex_lock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-07-25 01:46:59 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 continue_free : 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										kfree ( pathbuf ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										kfree ( agentbuf ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-07-25 16:47:45 +02:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										raw_spin_lock ( & release_list_lock ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
										
										
											2009-07-25 16:47:45 +02:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									raw_spin_unlock ( & release_list_lock ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2007-10-18 23:39:38 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									mutex_unlock ( & cgroup_mutex ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
									
										
										
										
											2008-04-04 14:29:57 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  int  __init  cgroup_disable ( char  * str )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys  * ss ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-04-04 14:29:57 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									char  * token ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-25 11:53:37 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									int  i ; 
							 
						 
					
						
							
								
									
										
										
										
											2008-04-04 14:29:57 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									while  ( ( token  =  strsep ( & str ,  " , " ) )  ! =  NULL )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										if  ( ! * token ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											continue ; 
							 
						 
					
						
							
								
									
										
										
										
											2012-09-13 09:50:55 +02:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										for_each_subsys ( ss ,  i )  { 
							 
						 
					
						
							
								
									
										
										
										
											2008-04-04 14:29:57 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											if  ( ! strcmp ( token ,  ss - > name ) )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												ss - > disabled  =  1 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												printk ( KERN_INFO  " Disabling %s control group " 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
													"  subsystem \n " ,  ss - > name ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												break ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  1 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								__setup ( " cgroup_disable= " ,  cgroup_disable ) ;  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: CSS ID support
Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code.
This patch attaches unique ID to each css and provides following.
 - css_lookup(subsys, id)
   returns pointer to struct cgroup_subysys_state of id.
 - css_get_next(subsys, id, rootid, depth, foundid)
   returns the next css under "root" by scanning
When cgroup_subsys->use_id is set, an id for css is maintained.
The cgroup framework only parepares
	- css_id of root css for subsys
	- id is automatically attached at creation of css.
	- id is *not* freed automatically. Because the cgroup framework
	  don't know lifetime of cgroup_subsys_state.
	  free_css_id() function is provided. This must be called by subsys.
There are several reasons to develop this.
	- Saving space .... For example, memcg's swap_cgroup is array of
	  pointers to cgroup. But it is not necessary to be very fast.
	  By replacing pointers(8bytes per ent) to ID (2byes per ent), we can
	  reduce much amount of memory usage.
	- Scanning without lock.
	  CSS_ID provides "scan id under this ROOT" function. By this, scanning
	  css under root can be written without locks.
	  ex)
	  do {
		rcu_read_lock();
		next = cgroup_get_next(subsys, id, root, &found);
		/* check sanity of next here */
		css_tryget();
		rcu_read_unlock();
		id = found + 1
	 } while(...)
Characteristics:
	- Each css has unique ID under subsys.
	- Lifetime of ID is controlled by subsys.
	- css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy
	- Allowed ID is 1-65535, ID 0 is UNUSED ID.
Design Choices:
	- scan-by-ID v.s. scan-by-tree-walk.
	  As /proc's pid scan does, scan-by-ID is robust when scanning is done
	  by following kind of routine.
	  scan -> rest a while(release a lock) -> conitunue from interrupted
	  memcg's hierarchical reclaim does this.
	- When subsys->use_id is set, # of css in the system is limited to
	  65535.
[bharata@linux.vnet.ibm.com: remove rcu_read_lock() from css_get_next()]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											 
										 
										
											2009-04-02 16:57:25 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 11:01:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:11:01 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  css_tryget_online_from_dir  -  get  corresponding  css  from  a  cgroup  dentry 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-26 18:40:56 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  @ dentry :  directory  dentry  of  interest 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ ss :  subsystem  of  interest 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-13 11:01:54 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:47 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 *  If  @ dentry  is  a  directory  for  a  cgroup  which  has  @ ss  enabled  on  it ,  try 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  to  get  the  corresponding  css  and  return  it .   If  such  css  doesn ' t  exist 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  or  can ' t  be  pinned ,  an  ERR_PTR  value  is  returned . 
							 
						 
					
						
							
								
									
										
										
										
											2011-02-14 11:20:01 +02:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:11:01 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								struct  cgroup_subsys_state  * css_tryget_online_from_dir ( struct  dentry  * dentry ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
														       struct  cgroup_subsys  * ss ) 
							 
						 
					
						
							
								
									
										
										
										
											2011-02-14 11:20:01 +02:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  kernfs_node  * kn  =  kernfs_node_from_dentry ( dentry ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * css  =  NULL ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-02-14 11:20:01 +02:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup  * cgrp ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-26 18:40:56 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/* is @dentry a cgroup dir? */ 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( dentry - > d_sb - > s_type  ! =  & cgroup_fs_type  | |  ! kn  | | 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									    kernfs_type ( kn )  ! =  KERNFS_DIR ) 
							 
						 
					
						
							
								
									
										
										
										
											2011-02-14 11:20:01 +02:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										return  ERR_PTR ( - EBADF ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:47 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									rcu_read_lock ( ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									/*
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  This  path  doesn ' t  originate  from  kernfs  and  @ kn  could  already 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									 *  have  been  or  be  removed  at  any  point .   @ kn - > priv  is  RCU 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:19:22 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 *  protected  for  this  access .   See  cgroup_rmdir ( )  for  details . 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.
* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.
* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.
* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.
      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.
    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.
    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
											 
										 
										
											2014-02-11 11:52:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									cgrp  =  rcu_dereference ( kn - > priv ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( cgrp ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										css  =  cgroup_css ( cgrp ,  ss ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:47 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-05-13 12:11:01 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									if  ( ! css  | |  ! css_tryget_online ( css ) ) 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-11 11:52:47 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										css  =  ERR_PTR ( - ENOENT ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									rcu_read_unlock ( ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  css ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-02-14 11:20:01 +02:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-19 10:05:24 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								/**
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  css_from_id  -  lookup  css  by  id 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ id :  the  cgroup  id 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  @ ss :  cgroup  subsys  to  be  looked  into 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 * 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Returns  the  css  if  there ' s  valid  one  with  @ id ,  otherwise  returns  NULL . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 *  Should  be  called  under  rcu_read_lock ( ) . 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								 */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								struct  cgroup_subsys_state  * css_from_id ( int  id ,  struct  cgroup_subsys  * ss )  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:13 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									WARN_ON_ONCE ( ! rcu_read_lock_held ( ) ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-05-04 15:09:14 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  idr_find ( & ss - > css_idr ,  id ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2011-02-14 11:20:01 +02:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								# ifdef CONFIG_CGROUP_DEBUG 
  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  struct  cgroup_subsys_state  *  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								debug_css_alloc ( struct  cgroup_subsys_state  * parent_css )  
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * css  =  kzalloc ( sizeof ( * css ) ,  GFP_KERNEL ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! css ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  ERR_PTR ( - ENOMEM ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  css ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  void  debug_css_free ( struct  cgroup_subsys_state  * css )  
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:23 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									kfree ( css ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  u64  debug_taskcount_read ( struct  cgroup_subsys_state  * css ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												struct  cftype  * cft ) 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  cgroup_task_count ( css - > cgroup ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  u64  current_css_set_read ( struct  cgroup_subsys_state  * css ,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												struct  cftype  * cft ) 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  ( u64 ) ( unsigned  long ) current - > cgroups ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  u64  current_css_set_refcount_read ( struct  cgroup_subsys_state  * css ,  
						 
					
						
							
								
									
										
										
										
											2013-06-14 11:17:19 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
													 struct  cftype  * cft ) 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									u64  count ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									rcu_read_lock ( ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-21 15:52:04 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									count  =  atomic_read ( & task_css_set ( current ) - > refcount ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									rcu_read_unlock ( ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									return  count ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  current_css_set_cg_links_read ( struct  seq_file  * seq ,  void  * v )  
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgrp_cset_link  * link ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  css_set  * cset ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									char  * name_buf ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									name_buf  =  kmalloc ( NAME_MAX  +  1 ,  GFP_KERNEL ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									if  ( ! name_buf ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										return  - ENOMEM ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									down_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									rcu_read_lock ( ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									cset  =  rcu_dereference ( current - > cgroups ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									list_for_each_entry ( link ,  & cset - > cgrp_links ,  cgrp_link )  { 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										struct  cgroup  * c  =  link - > cgrp ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:55 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										cgroup_name ( c ,  name_buf ,  NAME_MAX  +  1 ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:23 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										seq_printf ( seq ,  " Root %d group %s \n " , 
							 
						 
					
						
							
								
									
										
										
										
											2014-03-19 10:23:55 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											   c - > root - > hierarchy_id ,  name_buf ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									rcu_read_unlock ( ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									up_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2014-02-12 09:29:50 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									kfree ( name_buf ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# define MAX_TASKS_SHOWN_PER_CSS 25 
  
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  int  cgroup_css_links_read ( struct  seq_file  * seq ,  void  * v )  
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgroup_subsys_state  * css  =  seq_css ( seq ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									struct  cgrp_cset_link  * link ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									down_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									list_for_each_entry ( link ,  & css - > cgroup - > cset_links ,  cset_link )  { 
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:50 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										struct  css_set  * cset  =  link - > cset ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										struct  task_struct  * task ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										int  count  =  0 ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add css_set->mg_tasks
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, we're going to use task->cg_list during
migration too.  Instead of building a separate array, target tasks
will be linked into a dedicated migration list_head on the owning
css_set.  Tasks on the migration list are treated the same as tasks on
the usual tasks list; however, being on a separate list allows cgroup
migration code path to keep track of the target tasks by simply
keeping the list of css_sets with tasks being migrated, making
unpredictable dynamic allocation unnecessary.
In prepartion of such migration path update, this patch introduces
css_set->mg_tasks list and updates css_set task iterations so that
they walk both css_set->tasks and ->mg_tasks.  Note that ->mg_tasks
isn't used yet.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										seq_printf ( seq ,  " css_set %p \n " ,  cset ) ; 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add css_set->mg_tasks
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, we're going to use task->cg_list during
migration too.  Instead of building a separate array, target tasks
will be linked into a dedicated migration list_head on the owning
css_set.  Tasks on the migration list are treated the same as tasks on
the usual tasks list; however, being on a separate list allows cgroup
migration code path to keep track of the target tasks by simply
keeping the list of css_sets with tasks being migrated, making
unpredictable dynamic allocation unnecessary.
In prepartion of such migration path update, this patch introduces
css_set->mg_tasks list and updates css_set task iterations so that
they walk both css_set->tasks and ->mg_tasks.  Note that ->mg_tasks
isn't used yet.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-06-12 21:04:49 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										list_for_each_entry ( task ,  & cset - > tasks ,  cg_list )  { 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add css_set->mg_tasks
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, we're going to use task->cg_list during
migration too.  Instead of building a separate array, target tasks
will be linked into a dedicated migration list_head on the owning
css_set.  Tasks on the migration list are treated the same as tasks on
the usual tasks list; however, being on a separate list allows cgroup
migration code path to keep track of the target tasks by simply
keeping the list of css_sets with tasks being migrated, making
unpredictable dynamic allocation unnecessary.
In prepartion of such migration path update, this patch introduces
css_set->mg_tasks list and updates css_set task iterations so that
they walk both css_set->tasks and ->mg_tasks.  Note that ->mg_tasks
isn't used yet.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
											if  ( count + +  >  MAX_TASKS_SHOWN_PER_CSS ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												goto  overflow ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											seq_printf ( seq ,  "   task %d \n " ,  task_pid_vnr ( task ) ) ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										list_for_each_entry ( task ,  & cset - > mg_tasks ,  cg_list )  { 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											if  ( count + +  >  MAX_TASKS_SHOWN_PER_CSS ) 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
												goto  overflow ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
											seq_printf ( seq ,  "   task %d \n " ,  task_pid_vnr ( task ) ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										} 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: add css_set->mg_tasks
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit.  On 64bit, struct task_and_cgroup is
  24bytes making the flex element limit around 87k.  It is a high
  number but not impossible to hit.  This means that the current
  cgroup implementation can't migrate a process with more than 87k
  threads.
* Process migration involves memory allocation whose size is dependent
  on the number of threads the process has.  This means that cgroup
  core can't guarantee success or failure of multi-process migrations
  as memory allocation failure can happen in the middle.  This is in
  part because cgroup can't grab threadgroup locks of multiple
  processes at the same time, so when there are multiple processes to
  migrate, it is imposible to tell how many tasks are to be migrated
  beforehand.
  Note that this already affects cgroup_transfer_tasks().  cgroup
  currently cannot guarantee atomic success or failure of the
  operation.  It may fail in the middle and after such failure cgroup
  doesn't have enough information to roll back properly.  It just
  aborts with some tasks migrated and others not.
To resolve the situation, we're going to use task->cg_list during
migration too.  Instead of building a separate array, target tasks
will be linked into a dedicated migration list_head on the owning
css_set.  Tasks on the migration list are treated the same as tasks on
the usual tasks list; however, being on a separate list allows cgroup
migration code path to keep track of the target tasks by simply
keeping the list of css_sets with tasks being migrated, making
unpredictable dynamic allocation unnecessary.
In prepartion of such migration path update, this patch introduces
css_set->mg_tasks list and updates css_set task iterations so that
they walk both css_set->tasks and ->mg_tasks.  Note that ->mg_tasks
isn't used yet.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-25 10:04:01 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										continue ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									overflow : 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										seq_puts ( seq ,  "   ... \n " ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} 
							 
						 
					
						
							
								
									
										
											 
										
											
												cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks().  The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end.  It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users().  As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
											 
										 
										
											2014-02-13 06:58:40 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									up_read ( & css_set_rwsem ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  0 ; 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								static  u64  releasable_read ( struct  cgroup_subsys_state  * css ,  struct  cftype  * cft )  
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{  
						 
					
						
							
								
									
										
										
										
											2013-08-08 20:11:24 -04:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									return  test_bit ( CGRP_RELEASABLE ,  & css - > cgroup - > flags ) ; 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								}  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								static  struct  cftype  debug_files [ ]  =   {  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									{ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. name  =  " taskcount " , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. read_u64  =  debug_taskcount_read , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									{ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. name  =  " current_css_set " , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. read_u64  =  current_css_set_read , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									{ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. name  =  " current_css_set_refcount " , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. read_u64  =  current_css_set_refcount_read , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									{ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. name  =  " current_css_set_cg_links " , 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										. seq_show  =  current_css_set_cg_links_read , 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									{ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. name  =  " cgroup_css_links " , 
							 
						 
					
						
							
								
									
										
										
										
											2013-12-05 12:28:04 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
										. seq_show  =  cgroup_css_links_read , 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:22 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									} , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									{ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. name  =  " releasable " , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
										. read_u64  =  releasable_read , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									} , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									{  } 	/* terminate */ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								} ;  
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2014-02-08 10:36:58 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								struct  cgroup_subsys  debug_cgrp_subsys  =  {  
						 
					
						
							
								
									
										
										
										
											2012-11-19 08:13:38 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									. css_alloc  =  debug_css_alloc , 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									. css_free  =  debug_css_free , 
							 
						 
					
						
							
								
									
										
										
										
											2012-04-01 12:09:55 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
									. base_cftypes  =  debug_files , 
							 
						 
					
						
							
								
									
										
										
										
											2009-09-23 15:56:20 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								} ;  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# endif  /* CONFIG_CGROUP_DEBUG */