| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | #ifndef _LINUX_PRCTL_H
 | 
					
						
							|  |  |  | #define _LINUX_PRCTL_H
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation
During development of c/r we've noticed that in case if we need to support
user namespaces we face a problem with capabilities in prctl(PR_SET_MM,
...) call, in particular once new user namespace is created
capable(CAP_SYS_RESOURCE) no longer passes.
A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values
in one bundle, which would allow the kernel to make more intensive test
for sanity of values and same time allow us to support checkpoint/restore
of user namespaces.
Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of
prctl_mm_map structure which carries all the members to be updated.
	prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size)
	struct prctl_mm_map {
		__u64	start_code;
		__u64	end_code;
		__u64	start_data;
		__u64	end_data;
		__u64	start_brk;
		__u64	brk;
		__u64	start_stack;
		__u64	arg_start;
		__u64	arg_end;
		__u64	env_start;
		__u64	env_end;
		__u64	*auxv;
		__u32	auxv_size;
		__u32	exe_fd;
	};
All members except @exe_fd correspond ones of struct mm_struct.  To figure
out which available values these members may take here are meanings of the
members.
 - start_code, end_code: represent bounds of executable code area
 - start_data, end_data: represent bounds of data area
 - start_brk, brk: used to calculate bounds for brk() syscall
 - start_stack: used when accounting space needed for command
   line arguments, environment and shmat() syscall
 - arg_start, arg_end, env_start, env_end: represent memory area
   supplied for command line arguments and environment variables
 - auxv, auxv_size: carries auxiliary vector, Elf format specifics
 - exe_fd: file descriptor number for executable link (/proc/self/exe)
Thus we apply the following requirements to the values
1) Any member except @auxv, @auxv_size, @exe_fd is rather an address
   in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr)
   interval.
2) While @[start|end]_code and @[start|end]_data may point to an nonexisting
   VMAs (say a program maps own new .text and .data segments during execution)
   the rest of members should belong to VMA which must exist.
3) Addresses must be ordered, ie @start_ member must not be greater or
   equal to appropriate @end_ member.
4) As in regular Elf loading procedure we require that @start_brk and
   @brk be greater than @end_data.
5) If RLIMIT_DATA rlimit is set to non-infinity new values should not
   exceed existing limit. Same applies to RLIMIT_STACK.
6) Auxiliary vector size must not exceed existing one (which is
   predefined as AT_VECTOR_SIZE and depends on architecture).
7) File descriptor passed in @exe_file should be pointing
   to executable file (because we use existing prctl_set_mm_exe_file_locked
   helper it ensures that the file we are going to use as exe link has all
   required permission granted).
Now about where these members are involved inside kernel code:
 - @start_code and @end_code are used in /proc/$pid/[stat|statm] output;
 - @start_data and @end_data are used in /proc/$pid/[stat|statm] output,
   also they are considered if there enough space for brk() syscall
   result if RLIMIT_DATA is set;
 - @start_brk shown in /proc/$pid/stat output and accounted in brk()
   syscall if RLIMIT_DATA is set; also this member is tested to
   find a symbolic name of mmap event for perf system (we choose
   if event is generated for "heap" area); one more aplication is
   selinux -- we test if a process has PROCESS__EXECHEAP permission
   if trying to make heap area being executable with mprotect() syscall;
 - @brk is a current value for brk() syscall which lays inside heap
   area, it's shown in /proc/$pid/stat. When syscall brk() succesfully
   provides new memory area to a user space upon brk() completion the
   mm::brk is updated to carry new value;
   Both @start_brk and @brk are actively used in /proc/$pid/maps
   and /proc/$pid/smaps output to find a symbolic name "heap" for
   VMA being scanned;
 - @start_stack is printed out in /proc/$pid/stat and used to
   find a symbolic name "stack" for task and threads in
   /proc/$pid/maps and /proc/$pid/smaps output, and as the same
   as with @start_brk -- perf system uses it for event naming.
   Also kernel treat this member as a start address of where
   to map vDSO pages and to check if there is enough space
   for shmat() syscall;
 - @arg_start, @arg_end, @env_start and @env_end are printed out
   in /proc/$pid/stat. Another access to the data these members
   represent is to read /proc/$pid/environ or /proc/$pid/cmdline.
   Any attempt to read these areas kernel tests with access_process_vm
   helper so a user must have enough rights for this action;
 - @auxv and @auxv_size may be read from /proc/$pid/auxv. Strictly
   speaking kernel doesn't care much about which exactly data is
   sitting there because it is solely for userspace;
 - @exe_fd is referred from /proc/$pid/exe and when generating
   coredump. We uses prctl_set_mm_exe_file_locked helper to update
   this member, so exe-file link modification remains one-shot
   action.
Still note that updating exe-file link now doesn't require sys-resource
capability anymore, after all there is no much profit in preventing setup
own file link (there are a number of ways to execute own code -- ptrace,
ld-preload, so that the only reliable way to find which exactly code is
executed is to inspect running program memory).  Still we require the
caller to be at least user-namespace root user.
I believe the old interface should be deprecated and ripped off in a
couple of kernel releases if no one against.
To test if new interface is implemented in the kernel one can pass
PR_SET_MM_MAP_SIZE opcode and the kernel returns the size of currently
supported struct prctl_mm_map.
[akpm@linux-foundation.org: fix 80-col wordwrap in macro definitions]
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Andrew Vagin <avagin@openvz.org>
Tested-by: Andrew Vagin <avagin@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Julien Tinnes <jln@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2014-10-09 15:27:37 -07:00
										 |  |  | #include <linux/types.h>
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | /* Values to pass as first argument to prctl() */ | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | #define PR_SET_PDEATHSIG  1  /* Second arg is a signal */
 | 
					
						
							|  |  |  | #define PR_GET_PDEATHSIG  2  /* Second arg is a ptr to return the signal */
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /* Get/set current->mm->dumpable */ | 
					
						
							|  |  |  | #define PR_GET_DUMPABLE   3
 | 
					
						
							|  |  |  | #define PR_SET_DUMPABLE   4
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /* Get/set unaligned access control bits (if meaningful) */ | 
					
						
							|  |  |  | #define PR_GET_UNALIGN	  5
 | 
					
						
							|  |  |  | #define PR_SET_UNALIGN	  6
 | 
					
						
							|  |  |  | # define PR_UNALIGN_NOPRINT	1	/* silently fix up unaligned user accesses */
 | 
					
						
							|  |  |  | # define PR_UNALIGN_SIGBUS	2	/* generate SIGBUS on unaligned user access */
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												capabilities: implement per-process securebits
Filesystem capability support makes it possible to do away with (set)uid-0
based privilege and use capabilities instead.  That is, with filesystem
support for capabilities but without this present patch, it is (conceptually)
possible to manage a system with capabilities alone and never need to obtain
privilege via (set)uid-0.
Of course, conceptually isn't quite the same as currently possible since few
user applications, certainly not enough to run a viable system, are currently
prepared to leverage capabilities to exercise privilege.  Further, many
applications exist that may never get upgraded in this way, and the kernel
will continue to want to support their setuid-0 base privilege needs.
Where pure-capability applications evolve and replace setuid-0 binaries, it is
desirable that there be a mechanisms by which they can contain their
privilege.  In addition to leveraging the per-process bounding and inheritable
sets, this should include suppressing the privilege of the uid-0 superuser
from the process' tree of children.
The feature added by this patch can be leveraged to suppress the privilege
associated with (set)uid-0.  This suppression requires CAP_SETPCAP to
initiate, and only immediately affects the 'current' process (it is inherited
through fork()/exec()).  This reimplementation differs significantly from the
historical support for securebits which was system-wide, unwieldy and which
has ultimately withered to a dead relic in the source of the modern kernel.
With this patch applied a process, that is capable(CAP_SETPCAP), can now drop
all legacy privilege (through uid=0) for itself and all subsequently
fork()'d/exec()'d children with:
  prctl(PR_SET_SECUREBITS, 0x2f);
This patch represents a no-op unless CONFIG_SECURITY_FILE_CAPABILITIES is
enabled at configure time.
[akpm@linux-foundation.org: fix uninitialised var warning]
[serue@us.ibm.com: capabilities: use cap_task_prctl when !CONFIG_SECURITY]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Paul Moore <paul.moore@hp.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-04-28 02:13:40 -07:00
										 |  |  | /* Get/set whether or not to drop capabilities on setuid() away from
 | 
					
						
							|  |  |  |  * uid 0 (as per security/commoncap.c) */ | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | #define PR_GET_KEEPCAPS   7
 | 
					
						
							|  |  |  | #define PR_SET_KEEPCAPS   8
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /* Get/set floating-point emulation control bits (if meaningful) */ | 
					
						
							|  |  |  | #define PR_GET_FPEMU  9
 | 
					
						
							|  |  |  | #define PR_SET_FPEMU 10
 | 
					
						
							|  |  |  | # define PR_FPEMU_NOPRINT	1	/* silently emulate fp operations accesses */
 | 
					
						
							|  |  |  | # define PR_FPEMU_SIGFPE	2	/* don't emulate fp operations, send SIGFPE instead */
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /* Get/set floating-point exception mode (if meaningful) */ | 
					
						
							|  |  |  | #define PR_GET_FPEXC	11
 | 
					
						
							|  |  |  | #define PR_SET_FPEXC	12
 | 
					
						
							|  |  |  | # define PR_FP_EXC_SW_ENABLE	0x80	/* Use FPEXC for FP exception enables */
 | 
					
						
							|  |  |  | # define PR_FP_EXC_DIV		0x010000	/* floating point divide by zero */
 | 
					
						
							|  |  |  | # define PR_FP_EXC_OVF		0x020000	/* floating point overflow */
 | 
					
						
							|  |  |  | # define PR_FP_EXC_UND		0x040000	/* floating point underflow */
 | 
					
						
							|  |  |  | # define PR_FP_EXC_RES		0x080000	/* floating point inexact result */
 | 
					
						
							|  |  |  | # define PR_FP_EXC_INV		0x100000	/* floating point invalid operation */
 | 
					
						
							|  |  |  | # define PR_FP_EXC_DISABLED	0	/* FP exceptions disabled */
 | 
					
						
							|  |  |  | # define PR_FP_EXC_NONRECOV	1	/* async non-recoverable exc. mode */
 | 
					
						
							|  |  |  | # define PR_FP_EXC_ASYNC	2	/* async recoverable exception mode */
 | 
					
						
							|  |  |  | # define PR_FP_EXC_PRECISE	3	/* precise exception mode */
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /* Get/set whether we use statistical process timing or accurate timestamp
 | 
					
						
							|  |  |  |  * based process timing */ | 
					
						
							|  |  |  | #define PR_GET_TIMING   13
 | 
					
						
							|  |  |  | #define PR_SET_TIMING   14
 | 
					
						
							|  |  |  | # define PR_TIMING_STATISTICAL  0       /* Normal, traditional,
 | 
					
						
							|  |  |  |                                                    statistical process timing */ | 
					
						
							|  |  |  | # define PR_TIMING_TIMESTAMP    1       /* Accurate timestamp based
 | 
					
						
							|  |  |  |                                                    process timing */ | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | #define PR_SET_NAME    15		/* Set process name */
 | 
					
						
							|  |  |  | #define PR_GET_NAME    16		/* Get process name */
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-06-07 16:10:19 +10:00
										 |  |  | /* Get/set process endian */ | 
					
						
							|  |  |  | #define PR_GET_ENDIAN	19
 | 
					
						
							|  |  |  | #define PR_SET_ENDIAN	20
 | 
					
						
							|  |  |  | # define PR_ENDIAN_BIG		0
 | 
					
						
							|  |  |  | # define PR_ENDIAN_LITTLE	1	/* True little endian mode */
 | 
					
						
							|  |  |  | # define PR_ENDIAN_PPC_LITTLE	2	/* "PowerPC" pseudo little endian */
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-07-15 23:41:32 -07:00
										 |  |  | /* Get/set process seccomp mode */ | 
					
						
							|  |  |  | #define PR_GET_SECCOMP	21
 | 
					
						
							|  |  |  | #define PR_SET_SECCOMP	22
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												capabilities: implement per-process securebits
Filesystem capability support makes it possible to do away with (set)uid-0
based privilege and use capabilities instead.  That is, with filesystem
support for capabilities but without this present patch, it is (conceptually)
possible to manage a system with capabilities alone and never need to obtain
privilege via (set)uid-0.
Of course, conceptually isn't quite the same as currently possible since few
user applications, certainly not enough to run a viable system, are currently
prepared to leverage capabilities to exercise privilege.  Further, many
applications exist that may never get upgraded in this way, and the kernel
will continue to want to support their setuid-0 base privilege needs.
Where pure-capability applications evolve and replace setuid-0 binaries, it is
desirable that there be a mechanisms by which they can contain their
privilege.  In addition to leveraging the per-process bounding and inheritable
sets, this should include suppressing the privilege of the uid-0 superuser
from the process' tree of children.
The feature added by this patch can be leveraged to suppress the privilege
associated with (set)uid-0.  This suppression requires CAP_SETPCAP to
initiate, and only immediately affects the 'current' process (it is inherited
through fork()/exec()).  This reimplementation differs significantly from the
historical support for securebits which was system-wide, unwieldy and which
has ultimately withered to a dead relic in the source of the modern kernel.
With this patch applied a process, that is capable(CAP_SETPCAP), can now drop
all legacy privilege (through uid=0) for itself and all subsequently
fork()'d/exec()'d children with:
  prctl(PR_SET_SECUREBITS, 0x2f);
This patch represents a no-op unless CONFIG_SECURITY_FILE_CAPABILITIES is
enabled at configure time.
[akpm@linux-foundation.org: fix uninitialised var warning]
[serue@us.ibm.com: capabilities: use cap_task_prctl when !CONFIG_SECURITY]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Paul Moore <paul.moore@hp.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-04-28 02:13:40 -07:00
										 |  |  | /* Get/set the capability bounding set (as per security/commoncap.c) */ | 
					
						
							| 
									
										
											  
											
												capabilities: introduce per-process capability bounding set
The capability bounding set is a set beyond which capabilities cannot grow.
 Currently cap_bset is per-system.  It can be manipulated through sysctl,
but only init can add capabilities.  Root can remove capabilities.  By
default it includes all caps except CAP_SETPCAP.
This patch makes the bounding set per-process when file capabilities are
enabled.  It is inherited at fork from parent.  Noone can add elements,
CAP_SETPCAP is required to remove them.
One example use of this is to start a safer container.  For instance, until
device namespaces or per-container device whitelists are introduced, it is
best to take CAP_MKNOD away from a container.
The bounding set will not affect pP and pE immediately.  It will only
affect pP' and pE' after subsequent exec()s.  It also does not affect pI,
and exec() does not constrain pI'.  So to really start a shell with no way
of regain CAP_MKNOD, you would do
	prctl(PR_CAPBSET_DROP, CAP_MKNOD);
	cap_t cap = cap_get_proc();
	cap_value_t caparray[1];
	caparray[0] = CAP_MKNOD;
	cap_set_flag(cap, CAP_INHERITABLE, 1, caparray, CAP_DROP);
	cap_set_proc(cap);
	cap_free(cap);
The following test program will get and set the bounding
set (but not pI).  For instance
	./bset get
		(lists capabilities in bset)
	./bset drop cap_net_raw
		(starts shell with new bset)
		(use capset, setuid binary, or binary with
		file capabilities to try to increase caps)
************************************************************
cap_bound.c
************************************************************
 #include <sys/prctl.h>
 #include <linux/capability.h>
 #include <sys/types.h>
 #include <unistd.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #ifndef PR_CAPBSET_READ
 #define PR_CAPBSET_READ 23
 #endif
 #ifndef PR_CAPBSET_DROP
 #define PR_CAPBSET_DROP 24
 #endif
int usage(char *me)
{
	printf("Usage: %s get\n", me);
	printf("       %s drop <capability>\n", me);
	return 1;
}
 #define numcaps 32
char *captable[numcaps] = {
	"cap_chown",
	"cap_dac_override",
	"cap_dac_read_search",
	"cap_fowner",
	"cap_fsetid",
	"cap_kill",
	"cap_setgid",
	"cap_setuid",
	"cap_setpcap",
	"cap_linux_immutable",
	"cap_net_bind_service",
	"cap_net_broadcast",
	"cap_net_admin",
	"cap_net_raw",
	"cap_ipc_lock",
	"cap_ipc_owner",
	"cap_sys_module",
	"cap_sys_rawio",
	"cap_sys_chroot",
	"cap_sys_ptrace",
	"cap_sys_pacct",
	"cap_sys_admin",
	"cap_sys_boot",
	"cap_sys_nice",
	"cap_sys_resource",
	"cap_sys_time",
	"cap_sys_tty_config",
	"cap_mknod",
	"cap_lease",
	"cap_audit_write",
	"cap_audit_control",
	"cap_setfcap"
};
int getbcap(void)
{
	int comma=0;
	unsigned long i;
	int ret;
	printf("i know of %d capabilities\n", numcaps);
	printf("capability bounding set:");
	for (i=0; i<numcaps; i++) {
		ret = prctl(PR_CAPBSET_READ, i);
		if (ret < 0)
			perror("prctl");
		else if (ret==1)
			printf("%s%s", (comma++) ? ", " : " ", captable[i]);
	}
	printf("\n");
	return 0;
}
int capdrop(char *str)
{
	unsigned long i;
	int found=0;
	for (i=0; i<numcaps; i++) {
		if (strcmp(captable[i], str) == 0) {
			found=1;
			break;
		}
	}
	if (!found)
		return 1;
	if (prctl(PR_CAPBSET_DROP, i)) {
		perror("prctl");
		return 1;
	}
	return 0;
}
int main(int argc, char *argv[])
{
	if (argc<2)
		return usage(argv[0]);
	if (strcmp(argv[1], "get")==0)
		return getbcap();
	if (strcmp(argv[1], "drop")!=0 || argc<3)
		return usage(argv[0]);
	if (capdrop(argv[2])) {
		printf("unknown capability\n");
		return 1;
	}
	return execl("/bin/bash", "/bin/bash", NULL);
}
************************************************************
[serue@us.ibm.com: fix typo]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>a
Signed-off-by: "Serge E. Hallyn" <serue@us.ibm.com>
Tested-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-02-04 22:29:45 -08:00
										 |  |  | #define PR_CAPBSET_READ 23
 | 
					
						
							|  |  |  | #define PR_CAPBSET_DROP 24
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-04-11 18:54:17 +02:00
										 |  |  | /* Get/set the process' ability to use the timestamp counter instruction */ | 
					
						
							|  |  |  | #define PR_GET_TSC 25
 | 
					
						
							|  |  |  | #define PR_SET_TSC 26
 | 
					
						
							|  |  |  | # define PR_TSC_ENABLE		1	/* allow the use of the timestamp counter */
 | 
					
						
							|  |  |  | # define PR_TSC_SIGSEGV		2	/* throw a SIGSEGV instead of reading the TSC */
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												capabilities: implement per-process securebits
Filesystem capability support makes it possible to do away with (set)uid-0
based privilege and use capabilities instead.  That is, with filesystem
support for capabilities but without this present patch, it is (conceptually)
possible to manage a system with capabilities alone and never need to obtain
privilege via (set)uid-0.
Of course, conceptually isn't quite the same as currently possible since few
user applications, certainly not enough to run a viable system, are currently
prepared to leverage capabilities to exercise privilege.  Further, many
applications exist that may never get upgraded in this way, and the kernel
will continue to want to support their setuid-0 base privilege needs.
Where pure-capability applications evolve and replace setuid-0 binaries, it is
desirable that there be a mechanisms by which they can contain their
privilege.  In addition to leveraging the per-process bounding and inheritable
sets, this should include suppressing the privilege of the uid-0 superuser
from the process' tree of children.
The feature added by this patch can be leveraged to suppress the privilege
associated with (set)uid-0.  This suppression requires CAP_SETPCAP to
initiate, and only immediately affects the 'current' process (it is inherited
through fork()/exec()).  This reimplementation differs significantly from the
historical support for securebits which was system-wide, unwieldy and which
has ultimately withered to a dead relic in the source of the modern kernel.
With this patch applied a process, that is capable(CAP_SETPCAP), can now drop
all legacy privilege (through uid=0) for itself and all subsequently
fork()'d/exec()'d children with:
  prctl(PR_SET_SECUREBITS, 0x2f);
This patch represents a no-op unless CONFIG_SECURITY_FILE_CAPABILITIES is
enabled at configure time.
[akpm@linux-foundation.org: fix uninitialised var warning]
[serue@us.ibm.com: capabilities: use cap_task_prctl when !CONFIG_SECURITY]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Paul Moore <paul.moore@hp.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-04-28 02:13:40 -07:00
										 |  |  | /* Get/set securebits (as per security/commoncap.c) */ | 
					
						
							|  |  |  | #define PR_GET_SECUREBITS 27
 | 
					
						
							|  |  |  | #define PR_SET_SECUREBITS 28
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-09-01 15:52:40 -07:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * Get/set the timerslack as used by poll/select/nanosleep | 
					
						
							|  |  |  |  * A value of 0 means "use default" | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | #define PR_SET_TIMERSLACK 29
 | 
					
						
							|  |  |  | #define PR_GET_TIMERSLACK 30
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
  FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
  sed -i \
    -e 's/PERF_EVENT_/PERF_RECORD_/g' \
    -e 's/PERF_COUNTER/PERF_EVENT/g' \
    -e 's/perf_counter/perf_event/g' \
    -e 's/nb_counters/nb_events/g' \
    -e 's/swcounter/swevent/g' \
    -e 's/tpcounter_event/tp_event/g' \
    $FILES
  for N in $(find . -name perf_counter.[ch]); do
    M=$(echo $N | sed 's/perf_counter/perf_event/g')
    mv $N $M
  done
  FILES=$(find . -name perf_event.*)
  sed -i \
    -e 's/COUNTER_MASK/REG_MASK/g' \
    -e 's/COUNTER/EVENT/g' \
    -e 's/\<event\>/event_id/g' \
    -e 's/counter/event/g' \
    -e 's/Counter/Event/g' \
    $FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
  with hardware registers - and these sed scripts are a bit
  over-eager in renaming them. I've undone some of that, but
  in case there's something left where 'counter' would be
  better than 'event' we can undo that on an individual basis
  instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
											
										 
											2009-09-21 12:02:48 +02:00
										 |  |  | #define PR_TASK_PERF_EVENTS_DISABLE		31
 | 
					
						
							|  |  |  | #define PR_TASK_PERF_EVENTS_ENABLE		32
 | 
					
						
							| 
									
										
										
										
											2008-12-11 14:59:31 +01:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-10-04 02:20:11 +02:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * Set early/late kill mode for hwpoison memory corruption. | 
					
						
							|  |  |  |  * This influences when the process gets killed on a memory corruption. | 
					
						
							|  |  |  |  */ | 
					
						
							| 
									
										
										
										
											2009-09-16 11:50:14 +02:00
										 |  |  | #define PR_MCE_KILL	33
 | 
					
						
							| 
									
										
										
										
											2009-10-04 02:20:11 +02:00
										 |  |  | # define PR_MCE_KILL_CLEAR   0
 | 
					
						
							|  |  |  | # define PR_MCE_KILL_SET     1
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | # define PR_MCE_KILL_LATE    0
 | 
					
						
							|  |  |  | # define PR_MCE_KILL_EARLY   1
 | 
					
						
							|  |  |  | # define PR_MCE_KILL_DEFAULT 2
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | #define PR_MCE_KILL_GET 34
 | 
					
						
							| 
									
										
										
										
											2009-09-16 11:50:14 +02:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-12 17:20:55 -08:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * Tune up process memory map specifics. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | #define PR_SET_MM		35
 | 
					
						
							|  |  |  | # define PR_SET_MM_START_CODE		1
 | 
					
						
							|  |  |  | # define PR_SET_MM_END_CODE		2
 | 
					
						
							|  |  |  | # define PR_SET_MM_START_DATA		3
 | 
					
						
							|  |  |  | # define PR_SET_MM_END_DATA		4
 | 
					
						
							|  |  |  | # define PR_SET_MM_START_STACK		5
 | 
					
						
							|  |  |  | # define PR_SET_MM_START_BRK		6
 | 
					
						
							|  |  |  | # define PR_SET_MM_BRK			7
 | 
					
						
							| 
									
										
										
										
											2012-05-31 16:26:45 -07:00
										 |  |  | # define PR_SET_MM_ARG_START		8
 | 
					
						
							|  |  |  | # define PR_SET_MM_ARG_END		9
 | 
					
						
							|  |  |  | # define PR_SET_MM_ENV_START		10
 | 
					
						
							|  |  |  | # define PR_SET_MM_ENV_END		11
 | 
					
						
							|  |  |  | # define PR_SET_MM_AUXV			12
 | 
					
						
							| 
									
										
										
										
											2012-05-31 16:26:46 -07:00
										 |  |  | # define PR_SET_MM_EXE_FILE		13
 | 
					
						
							| 
									
										
											  
											
												prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation
During development of c/r we've noticed that in case if we need to support
user namespaces we face a problem with capabilities in prctl(PR_SET_MM,
...) call, in particular once new user namespace is created
capable(CAP_SYS_RESOURCE) no longer passes.
A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values
in one bundle, which would allow the kernel to make more intensive test
for sanity of values and same time allow us to support checkpoint/restore
of user namespaces.
Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of
prctl_mm_map structure which carries all the members to be updated.
	prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size)
	struct prctl_mm_map {
		__u64	start_code;
		__u64	end_code;
		__u64	start_data;
		__u64	end_data;
		__u64	start_brk;
		__u64	brk;
		__u64	start_stack;
		__u64	arg_start;
		__u64	arg_end;
		__u64	env_start;
		__u64	env_end;
		__u64	*auxv;
		__u32	auxv_size;
		__u32	exe_fd;
	};
All members except @exe_fd correspond ones of struct mm_struct.  To figure
out which available values these members may take here are meanings of the
members.
 - start_code, end_code: represent bounds of executable code area
 - start_data, end_data: represent bounds of data area
 - start_brk, brk: used to calculate bounds for brk() syscall
 - start_stack: used when accounting space needed for command
   line arguments, environment and shmat() syscall
 - arg_start, arg_end, env_start, env_end: represent memory area
   supplied for command line arguments and environment variables
 - auxv, auxv_size: carries auxiliary vector, Elf format specifics
 - exe_fd: file descriptor number for executable link (/proc/self/exe)
Thus we apply the following requirements to the values
1) Any member except @auxv, @auxv_size, @exe_fd is rather an address
   in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr)
   interval.
2) While @[start|end]_code and @[start|end]_data may point to an nonexisting
   VMAs (say a program maps own new .text and .data segments during execution)
   the rest of members should belong to VMA which must exist.
3) Addresses must be ordered, ie @start_ member must not be greater or
   equal to appropriate @end_ member.
4) As in regular Elf loading procedure we require that @start_brk and
   @brk be greater than @end_data.
5) If RLIMIT_DATA rlimit is set to non-infinity new values should not
   exceed existing limit. Same applies to RLIMIT_STACK.
6) Auxiliary vector size must not exceed existing one (which is
   predefined as AT_VECTOR_SIZE and depends on architecture).
7) File descriptor passed in @exe_file should be pointing
   to executable file (because we use existing prctl_set_mm_exe_file_locked
   helper it ensures that the file we are going to use as exe link has all
   required permission granted).
Now about where these members are involved inside kernel code:
 - @start_code and @end_code are used in /proc/$pid/[stat|statm] output;
 - @start_data and @end_data are used in /proc/$pid/[stat|statm] output,
   also they are considered if there enough space for brk() syscall
   result if RLIMIT_DATA is set;
 - @start_brk shown in /proc/$pid/stat output and accounted in brk()
   syscall if RLIMIT_DATA is set; also this member is tested to
   find a symbolic name of mmap event for perf system (we choose
   if event is generated for "heap" area); one more aplication is
   selinux -- we test if a process has PROCESS__EXECHEAP permission
   if trying to make heap area being executable with mprotect() syscall;
 - @brk is a current value for brk() syscall which lays inside heap
   area, it's shown in /proc/$pid/stat. When syscall brk() succesfully
   provides new memory area to a user space upon brk() completion the
   mm::brk is updated to carry new value;
   Both @start_brk and @brk are actively used in /proc/$pid/maps
   and /proc/$pid/smaps output to find a symbolic name "heap" for
   VMA being scanned;
 - @start_stack is printed out in /proc/$pid/stat and used to
   find a symbolic name "stack" for task and threads in
   /proc/$pid/maps and /proc/$pid/smaps output, and as the same
   as with @start_brk -- perf system uses it for event naming.
   Also kernel treat this member as a start address of where
   to map vDSO pages and to check if there is enough space
   for shmat() syscall;
 - @arg_start, @arg_end, @env_start and @env_end are printed out
   in /proc/$pid/stat. Another access to the data these members
   represent is to read /proc/$pid/environ or /proc/$pid/cmdline.
   Any attempt to read these areas kernel tests with access_process_vm
   helper so a user must have enough rights for this action;
 - @auxv and @auxv_size may be read from /proc/$pid/auxv. Strictly
   speaking kernel doesn't care much about which exactly data is
   sitting there because it is solely for userspace;
 - @exe_fd is referred from /proc/$pid/exe and when generating
   coredump. We uses prctl_set_mm_exe_file_locked helper to update
   this member, so exe-file link modification remains one-shot
   action.
Still note that updating exe-file link now doesn't require sys-resource
capability anymore, after all there is no much profit in preventing setup
own file link (there are a number of ways to execute own code -- ptrace,
ld-preload, so that the only reliable way to find which exactly code is
executed is to inspect running program memory).  Still we require the
caller to be at least user-namespace root user.
I believe the old interface should be deprecated and ripped off in a
couple of kernel releases if no one against.
To test if new interface is implemented in the kernel one can pass
PR_SET_MM_MAP_SIZE opcode and the kernel returns the size of currently
supported struct prctl_mm_map.
[akpm@linux-foundation.org: fix 80-col wordwrap in macro definitions]
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Andrew Vagin <avagin@openvz.org>
Tested-by: Andrew Vagin <avagin@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Julien Tinnes <jln@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2014-10-09 15:27:37 -07:00
										 |  |  | # define PR_SET_MM_MAP			14
 | 
					
						
							|  |  |  | # define PR_SET_MM_MAP_SIZE		15
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /*
 | 
					
						
							|  |  |  |  * This structure provides new memory descriptor | 
					
						
							|  |  |  |  * map which mostly modifies /proc/pid/stat[m] | 
					
						
							|  |  |  |  * output for a task. This mostly done in a | 
					
						
							|  |  |  |  * sake of checkpoint/restore functionality. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | struct prctl_mm_map { | 
					
						
							|  |  |  | 	__u64	start_code;		/* code section bounds */ | 
					
						
							|  |  |  | 	__u64	end_code; | 
					
						
							|  |  |  | 	__u64	start_data;		/* data section bounds */ | 
					
						
							|  |  |  | 	__u64	end_data; | 
					
						
							|  |  |  | 	__u64	start_brk;		/* heap for brk() syscall */ | 
					
						
							|  |  |  | 	__u64	brk; | 
					
						
							|  |  |  | 	__u64	start_stack;		/* stack starts at */ | 
					
						
							|  |  |  | 	__u64	arg_start;		/* command line arguments bounds */ | 
					
						
							|  |  |  | 	__u64	arg_end; | 
					
						
							|  |  |  | 	__u64	env_start;		/* environment variables bounds */ | 
					
						
							|  |  |  | 	__u64	env_end; | 
					
						
							|  |  |  | 	__u64	*auxv;			/* auxiliary vector */ | 
					
						
							|  |  |  | 	__u32	auxv_size;		/* vector size */ | 
					
						
							|  |  |  | 	__u32	exe_fd;			/* /proc/$pid/exe link file */ | 
					
						
							|  |  |  | }; | 
					
						
							| 
									
										
										
										
											2012-01-12 17:20:55 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-12-21 12:17:04 -08:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * Set specific pid that is allowed to ptrace the current task. | 
					
						
							|  |  |  |  * A value of 0 mean "no process". | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | #define PR_SET_PTRACER 0x59616d61
 | 
					
						
							| 
									
										
										
										
											2012-02-14 16:48:09 -08:00
										 |  |  | # define PR_SET_PTRACER_ANY ((unsigned long)-1)
 | 
					
						
							| 
									
										
										
										
											2011-12-21 12:17:04 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-06-07 14:21:12 -07:00
										 |  |  | #define PR_SET_CHILD_SUBREAPER	36
 | 
					
						
							|  |  |  | #define PR_GET_CHILD_SUBREAPER	37
 | 
					
						
							| 
									
										
											  
											
												prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision
Userspace service managers/supervisors need to track their started
services.  Many services daemonize by double-forking and get implicitly
re-parented to PID 1.  The service manager will no longer be able to
receive the SIGCHLD signals for them, and is no longer in charge of
reaping the children with wait().  All information about the children is
lost at the moment PID 1 cleans up the re-parented processes.
With this prctl, a service manager process can mark itself as a sort of
'sub-init', able to stay as the parent for all orphaned processes
created by the started services.  All SIGCHLD signals will be delivered
to the service manager.
Receiving SIGCHLD and doing wait() is in cases of a service-manager much
preferred over any possible asynchronous notification about specific
PIDs, because the service manager has full access to the child process
data in /proc and the PID can not be re-used until the wait(), the
service-manager itself is in charge of, has happened.
As a side effect, the relevant parent PID information does not get lost
by a double-fork, which results in a more elaborate process tree and
'ps' output:
before:
  # ps afx
  253 ?        Ss     0:00 /bin/dbus-daemon --system --nofork
  294 ?        Sl     0:00 /usr/libexec/polkit-1/polkitd
  328 ?        S      0:00 /usr/sbin/modem-manager
  608 ?        Sl     0:00 /usr/libexec/colord
  658 ?        Sl     0:00 /usr/libexec/upowerd
  819 ?        Sl     0:00 /usr/libexec/imsettings-daemon
  916 ?        Sl     0:00 /usr/libexec/udisks-daemon
  917 ?        S      0:00  \_ udisks-daemon: not polling any devices
after:
  # ps afx
  294 ?        Ss     0:00 /bin/dbus-daemon --system --nofork
  426 ?        Sl     0:00  \_ /usr/libexec/polkit-1/polkitd
  449 ?        S      0:00  \_ /usr/sbin/modem-manager
  635 ?        Sl     0:00  \_ /usr/libexec/colord
  705 ?        Sl     0:00  \_ /usr/libexec/upowerd
  959 ?        Sl     0:00  \_ /usr/libexec/udisks-daemon
  960 ?        S      0:00  |   \_ udisks-daemon: not polling any devices
  977 ?        Sl     0:00  \_ /usr/libexec/packagekitd
This prctl is orthogonal to PID namespaces.  PID namespaces are isolated
from each other, while a service management process usually requires the
services to live in the same namespace, to be able to talk to each
other.
Users of this will be the systemd per-user instance, which provides
init-like functionality for the user's login session and D-Bus, which
activates bus services on-demand.  Both need init-like capabilities to
be able to properly keep track of the services they start.
Many thanks to Oleg for several rounds of review and insights.
[akpm@linux-foundation.org: fix comment layout and spelling]
[akpm@linux-foundation.org: add lengthy code comment from Oleg]
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Lennart Poettering <lennart@poettering.net>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Acked-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2012-03-23 15:01:54 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												Add PR_{GET,SET}_NO_NEW_PRIVS to prevent execve from granting privs
With this change, calling
  prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)
disables privilege granting operations at execve-time.  For example, a
process will not be able to execute a setuid binary to change their uid
or gid if this bit is set.  The same is true for file capabilities.
Additionally, LSM_UNSAFE_NO_NEW_PRIVS is defined to ensure that
LSMs respect the requested behavior.
To determine if the NO_NEW_PRIVS bit is set, a task may call
  prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
It returns 1 if set and 0 if it is not set. If any of the arguments are
non-zero, it will return -1 and set errno to -EINVAL.
(PR_SET_NO_NEW_PRIVS behaves similarly.)
This functionality is desired for the proposed seccomp filter patch
series.  By using PR_SET_NO_NEW_PRIVS, it allows a task to modify the
system call behavior for itself and its child tasks without being
able to impact the behavior of a more privileged task.
Another potential use is making certain privileged operations
unprivileged.  For example, chroot may be considered "safe" if it cannot
affect privileged tasks.
Note, this patch causes execve to fail when PR_SET_NO_NEW_PRIVS is
set and AppArmor is in use.  It is fixed in a subsequent patch.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Eric Paris <eparis@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
v18: updated change desc
v17: using new define values as per 3.4
Signed-off-by: James Morris <james.l.morris@oracle.com>
											
										 
											2012-04-12 16:47:50 -05:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * If no_new_privs is set, then operations that grant new privileges (i.e. | 
					
						
							|  |  |  |  * execve) will either fail or not grant them.  This affects suid/sgid, | 
					
						
							|  |  |  |  * file capabilities, and LSMs. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * Operations that merely manipulate or drop existing privileges (setresuid, | 
					
						
							|  |  |  |  * capset, etc.) will still work.  Drop those privileges if you want them gone. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * Changing LSM security domain is considered a new privilege.  So, for example, | 
					
						
							|  |  |  |  * asking selinux for a specific new context (e.g. with runcon) will result | 
					
						
							|  |  |  |  * in execve returning -EPERM. | 
					
						
							| 
									
										
										
										
											2012-07-05 11:23:24 -07:00
										 |  |  |  * | 
					
						
							|  |  |  |  * See Documentation/prctl/no_new_privs.txt for more details. | 
					
						
							| 
									
										
											  
											
												Add PR_{GET,SET}_NO_NEW_PRIVS to prevent execve from granting privs
With this change, calling
  prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)
disables privilege granting operations at execve-time.  For example, a
process will not be able to execute a setuid binary to change their uid
or gid if this bit is set.  The same is true for file capabilities.
Additionally, LSM_UNSAFE_NO_NEW_PRIVS is defined to ensure that
LSMs respect the requested behavior.
To determine if the NO_NEW_PRIVS bit is set, a task may call
  prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
It returns 1 if set and 0 if it is not set. If any of the arguments are
non-zero, it will return -1 and set errno to -EINVAL.
(PR_SET_NO_NEW_PRIVS behaves similarly.)
This functionality is desired for the proposed seccomp filter patch
series.  By using PR_SET_NO_NEW_PRIVS, it allows a task to modify the
system call behavior for itself and its child tasks without being
able to impact the behavior of a more privileged task.
Another potential use is making certain privileged operations
unprivileged.  For example, chroot may be considered "safe" if it cannot
affect privileged tasks.
Note, this patch causes execve to fail when PR_SET_NO_NEW_PRIVS is
set and AppArmor is in use.  It is fixed in a subsequent patch.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Eric Paris <eparis@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
v18: updated change desc
v17: using new define values as per 3.4
Signed-off-by: James Morris <james.l.morris@oracle.com>
											
										 
											2012-04-12 16:47:50 -05:00
										 |  |  |  */ | 
					
						
							| 
									
										
										
										
											2012-06-07 14:21:12 -07:00
										 |  |  | #define PR_SET_NO_NEW_PRIVS	38
 | 
					
						
							|  |  |  | #define PR_GET_NO_NEW_PRIVS	39
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | #define PR_GET_TID_ADDRESS	40
 | 
					
						
							| 
									
										
											  
											
												Add PR_{GET,SET}_NO_NEW_PRIVS to prevent execve from granting privs
With this change, calling
  prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)
disables privilege granting operations at execve-time.  For example, a
process will not be able to execute a setuid binary to change their uid
or gid if this bit is set.  The same is true for file capabilities.
Additionally, LSM_UNSAFE_NO_NEW_PRIVS is defined to ensure that
LSMs respect the requested behavior.
To determine if the NO_NEW_PRIVS bit is set, a task may call
  prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
It returns 1 if set and 0 if it is not set. If any of the arguments are
non-zero, it will return -1 and set errno to -EINVAL.
(PR_SET_NO_NEW_PRIVS behaves similarly.)
This functionality is desired for the proposed seccomp filter patch
series.  By using PR_SET_NO_NEW_PRIVS, it allows a task to modify the
system call behavior for itself and its child tasks without being
able to impact the behavior of a more privileged task.
Another potential use is making certain privileged operations
unprivileged.  For example, chroot may be considered "safe" if it cannot
affect privileged tasks.
Note, this patch causes execve to fail when PR_SET_NO_NEW_PRIVS is
set and AppArmor is in use.  It is fixed in a subsequent patch.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Eric Paris <eparis@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
v18: updated change desc
v17: using new define values as per 3.4
Signed-off-by: James Morris <james.l.morris@oracle.com>
											
										 
											2012-04-12 16:47:50 -05:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2014-04-07 15:37:10 -07:00
										 |  |  | #define PR_SET_THP_DISABLE	41
 | 
					
						
							|  |  |  | #define PR_GET_THP_DISABLE	42
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												x86, mpx: On-demand kernel allocation of bounds tables
This is really the meat of the MPX patch set.  If there is one patch to
review in the entire series, this is the one.  There is a new ABI here
and this kernel code also interacts with userspace memory in a
relatively unusual manner.  (small FAQ below).
Long Description:
This patch adds two prctl() commands to provide enable or disable the
management of bounds tables in kernel, including on-demand kernel
allocation (See the patch "on-demand kernel allocation of bounds tables")
and cleanup (See the patch "cleanup unused bound tables"). Applications
do not strictly need the kernel to manage bounds tables and we expect
some applications to use MPX without taking advantage of this kernel
support. This means the kernel can not simply infer whether an application
needs bounds table management from the MPX registers.  The prctl() is an
explicit signal from userspace.
PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
require kernel's help in managing bounds tables.
PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
won't allocate and free bounds tables even if the CPU supports MPX.
PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
directory out of a userspace register (bndcfgu) and then cache it into
a new field (->bd_addr) in  the 'mm_struct'.  PR_MPX_DISABLE_MANAGEMENT
will set "bd_addr" to an invalid address.  Using this scheme, we can
use "bd_addr" to determine whether the management of bounds tables in
kernel is enabled.
Also, the only way to access that bndcfgu register is via an xsaves,
which can be expensive.  Caching "bd_addr" like this also helps reduce
the cost of those xsaves when doing table cleanup at munmap() time.
Unfortunately, we can not apply this optimization to #BR fault time
because we need an xsave to get the value of BNDSTATUS.
==== Why does the hardware even have these Bounds Tables? ====
MPX only has 4 hardware registers for storing bounds information.
If MPX-enabled code needs more than these 4 registers, it needs to
spill them somewhere. It has two special instructions for this
which allow the bounds to be moved between the bounds registers
and some new "bounds tables".
They are similar conceptually to a page fault and will be raised by
the MPX hardware during both bounds violations or when the tables
are not present. This patch handles those #BR exceptions for
not-present tables by carving the space out of the normal processes
address space (essentially calling the new mmap() interface indroduced
earlier in this patch set.) and then pointing the bounds-directory
over to it.
The tables *need* to be accessed and controlled by userspace because
the instructions for moving bounds in and out of them are extremely
frequent. They potentially happen every time a register pointing to
memory is dereferenced. Any direct kernel involvement (like a syscall)
to access the tables would obviously destroy performance.
==== Why not do this in userspace? ====
This patch is obviously doing this allocation in the kernel.
However, MPX does not strictly *require* anything in the kernel.
It can theoretically be done completely from userspace. Here are
a few ways this *could* be done. I don't think any of them are
practical in the real-world, but here they are.
Q: Can virtual space simply be reserved for the bounds tables so
   that we never have to allocate them?
A: As noted earlier, these tables are *HUGE*. An X-GB virtual
   area needs 4*X GB of virtual space, plus 2GB for the bounds
   directory. If we were to preallocate them for the 128TB of
   user virtual address space, we would need to reserve 512TB+2GB,
   which is larger than the entire virtual address space today.
   This means they can not be reserved ahead of time. Also, a
   single process's pre-popualated bounds directory consumes 2GB
   of virtual *AND* physical memory. IOW, it's completely
   infeasible to prepopulate bounds directories.
Q: Can we preallocate bounds table space at the same time memory
   is allocated which might contain pointers that might eventually
   need bounds tables?
A: This would work if we could hook the site of each and every
   memory allocation syscall. This can be done for small,
   constrained applications. But, it isn't practical at a larger
   scale since a given app has no way of controlling how all the
   parts of the app might allocate memory (think libraries). The
   kernel is really the only place to intercept these calls.
Q: Could a bounds fault be handed to userspace and the tables
   allocated there in a signal handler instead of in the kernel?
A: (thanks to tglx) mmap() is not on the list of safe async
   handler functions and even if mmap() would work it still
   requires locking or nasty tricks to keep track of the
   allocation state there.
Having ruled out all of the userspace-only approaches for managing
bounds tables that we could think of, we create them on demand in
the kernel.
Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
											
										 
											2014-11-14 07:18:29 -08:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * Tell the kernel to start/stop helping userspace manage bounds tables. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | #define PR_MPX_ENABLE_MANAGEMENT  43
 | 
					
						
							|  |  |  | #define PR_MPX_DISABLE_MANAGEMENT 44
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2015-01-08 12:17:37 +00:00
										 |  |  | #define PR_SET_FP_MODE		45
 | 
					
						
							|  |  |  | #define PR_GET_FP_MODE		46
 | 
					
						
							|  |  |  | # define PR_FP_MODE_FR		(1 << 0)	/* 64b FP registers */
 | 
					
						
							|  |  |  | # define PR_FP_MODE_FRE		(1 << 1)	/* 32b compatibility */
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | #endif /* _LINUX_PRCTL_H */
 |