When CONFIG_CFI_PERMISSIVE is not set, ensure the third argument
passed to __cfi_check from __cfi_slowpath is NULL to avoid an invalid
memory access in __cfi_check_fail. __cfi_check_fail always traps
anyway, but the error message will be less confusing with this patch.
Note that kernels built with full LTO aren't affected as they always
clear the argument before a __cfi_slowpath call. Later kernel versions
are also not affected as they use -fno-sanitize-trap=cfi.
Bug: 196763360
Change-Id: Ifa5b4e324737a3069f7a772dd9b392042ec8407e
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
If rcu_read_lock_sched tracing is enabled, the tracing subsystem can
perform a jump which needs to be checked by CFI. For example, stm_ftrace
source is enabled as a module and hooks into enabled ftrace events. This
can cause an recursive loop where find_shadow_check_fn ->
rcu_read_lock_sched -> (call to stm_ftrace generates cfi slowpath) ->
find_shadow_check_fn -> rcu_read_lock_sched -> ...
To avoid the recursion, either the ftrace codes needs to be marked with
__no_cfi or CFI should not trace. Use the "_notrace" in CFI to avoid
tracing so that CFI can guard ftrace.
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Cc: stable@vger.kernel.org
Fixes: cf68fffb66 ("add support for Clang CFI")
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210811155914.19550-1-quic_eberman@quicinc.com
Bug: 194223154
Change-Id: I7d112496c7f503f95ba69390f6454623cf6dfed2
(cherry picked from commit 14c4c8e415)
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
If rcu_print_task_stall() is invoked on an rcu_node structure that does
not contain any tasks blocking the current grace period, it takes an
early exit that fails to release that rcu_node structure's lock. This
results in a self-deadlock, which is detected by lockdep.
To reproduce this bug:
tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0"
This will also result in other complaints, including RCU's scheduler
hook complaining about blocking rather than preemption and an rcutorture
writer stall.
Only a partial RCU CPU stall warning message will be printed because of
the self-deadlock.
This commit therefore releases the lock on the rcu_print_task_stall()
function's early exit path.
Fixes: c583bcb8f5 ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
Tested-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
BUG: 196874644
(cherry picked from commit dc87740c8ahttps://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git dev)
Signed-off-by: Cheng Jui Wang <cheng-jui.wang@mediatek.com>
Change-Id: I0942973e3fbac2d666d8eb9ed59b1701af13248a
For power and performance monitoring, need to known tasks' runtime for
loading estimation.
But now, other modules can't get task_scehd_runtime.
Export task_sched_runtime to let other modules get task_scehd_runtime.
Bug: 195914330
Signed-off-by: Poting Chen <poting.chen@mediatek.com>
Signed-off-by: Cheng Jui Wang <cheng-jui.wang@mediatek.com>
Change-Id: Ida5caf8ed0a32954fc0b0ed950f163c7ca493fef
There is currently nothing preventing tasks from changing their per-task
clamp values in anyway that they like. The rationale is probably that
system administrators are still able to limit those clamps thanks to the
cgroup interface. However, this causes pain in a system where both
per-task and per-cgroup clamp values are expected to be under the
control of core system components (as is the case for Android).
To fix this, let's require CAP_SYS_NICE to change per-task clamp values.
There are ongoing discussions upstream about more flexible approaches
than this using the RLIMIT API -- see [1]. But the upstream discussion
has not converged yet, and this is way too late for UAPI changes in
android12-5.10 anyway, so let's apply this change which provides the
behaviour we want without actually impacting UAPIs.
[1] https://lore.kernel.org/lkml/20210623123441.592348-4-qperret@google.com/
Bug: 187186685
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I749312a77306460318ac5374cf243d00b78120dd
SCHED_FLAG_KEEP_PARAMS can be passed to sched_setattr to specify that
the call must not touch scheduling parameters (nice or priority). This
is particularly handy for uclamp when used in conjunction with
SCHED_FLAG_KEEP_POLICY as that allows to issue a syscall that only
impacts uclamp values.
However, sched_setattr always checks whether the priorities and nice
values passed in sched_attr are valid first, even if those never get
used down the line. This is useless at best since userspace can
trivially bypass this check to set the uclamp values by specifying low
priorities. However, it is cumbersome to do so as there is no single
expression of this that skips both RT and CFS checks at once. As such,
userspace needs to query the task policy first with e.g. sched_getattr
and then set sched_attr.sched_priority accordingly. This is racy and
slower than a single call.
As the priority and nice checks are useless when SCHED_FLAG_KEEP_PARAMS
is specified, simply inherit them in this case to match the policy
inheritance of SCHED_FLAG_KEEP_POLICY.
Reported-by: Wei Wang <wvw@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Link: https://lore.kernel.org/r/20210805102154.590709-3-qperret@google.com
Bug: 190237315
(cherry picked from commit f4dddf90d5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core)
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: Ifdbc9262b82c7f5c0d34952ece07770a53e3f6a5
SCHED_FLAG_SUGOV is supposed to be a kernel-only flag that userspace
cannot interact with. However, sched_getattr() currently reports it
in sched_flags if called on a sugov worker even though it is not
actually defined in a UAPI header. To avoid this, make sure to
clean-up the sched_flags field in sched_getattr() before returning to
userspace.
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210727101103.2729607-3-qperret@google.com
Bug: 190237315
(cherry picked from commit 7ad721bf10
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core)
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: Ib998d497fc38a7f8e6ccb80119336c9ac30719b7
It is possible for sched_getattr() to incorrectly report the state of
the reset_on_fork flag when called on a deadline task.
Indeed, if the flag was set on a deadline task using sched_setattr()
with flags (SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_KEEP_PARAMS), then
p->sched_reset_on_fork will be set, but __setscheduler() will bail out
early, which means that the dl_se->flags will not get updated by
__setscheduler_params()->__setparam_dl(). Consequently, if
sched_getattr() is then called on the task, __getparam_dl() will
override kattr.sched_flags with the now out-of-date copy in dl_se->flags
and report the stale value to userspace.
To fix this, make sure to only copy the flags that are relevant to
sched_deadline to and from the dl_se->flags field.
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210727101103.2729607-2-qperret@google.com
Bug: 190237315
(cherry picked from commit f95091536f
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core)
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I251a433e0ddde6b63881f92821bc0d47c1693a02
The UCLAMP_FLAG_IDLE flag is set on a runqueue when dequeueing the last
uclamp active task (that is, when buckets.tasks reaches 0 for all
buckets) to maintain the last uclamp.max and prevent blocked util from
suddenly becoming visible.
However, there is an asymmetry in how the flag is set and cleared which
can lead to having the flag set whilst there are active tasks on the rq.
Specifically, the flag is cleared in the uclamp_rq_inc() path, which is
called at enqueue time, but set in uclamp_rq_dec_id() which is called
both when dequeueing a task _and_ in the update_uclamp_active() path. As
a result, when both uclamp_rq_{dec,ind}_id() are called from
update_uclamp_active(), the flag ends up being set but not cleared,
hence leaving the runqueue in a broken state.
Fix this by clearing the flag in update_uclamp_active() as well.
Fixes: e496187da7 ("sched/uclamp: Enforce last task's UCLAMP_MAX")
Reported-by: Rick Yiu <rickyiu@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20210805102154.590709-2-qperret@google.com
[ qperret: BACKPORT due to trivial cherry-pick conflict caused by
0213b7083e ("sched/uclamp: Fix uclamp_tg_restrict()") missing
from 5.10. ]
Bug: 192559209
(cherry picked from commit ca4984a7dd
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core)
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I7b3418e553ba0f06dd5ef6f0d38a99c3210ae897
Add a new helper function and export it for vendor module to
dynamically switch to an alternative half-life at runtime.
Bug: 195474490
Signed-off-by: JianMin Liu <jian-min.liu@mediatek.com>
Change-Id: Ife41997a032fe3384cfa126cbf7aee929c5c11cf
This effectively locks down OWNERS approval to a small group to guard
the code base against unintentional breakages.
Bug: 194314089
Signed-off-by: Matthias Maennich <maennich@google.com>
Change-Id: Ifd1ea97639a622320ea83f901f6451e2e52b38d4
Allow vendors to obtain a list of modules loaded at given time. Vendor
modules are able to register on part of notifier chain
(register_module_notifer), but a vendor module would never see modules
which are loaded before the one which registers on the notifier chain.
The kernel doesn't offer load order control, so a hook is necessary to
iterate through currently loaded kernel modules.
Bug: 193552324
Change-Id: I3b01cc1b90f8c0c7c21a37992cc7d607316efc7b
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
This change introduces a prctl that allows the user program to control
which PAC keys are enabled in a particular task. The main reason
why this is useful is to enable a userspace ABI that uses PAC to
sign and authenticate function pointers and other pointers exposed
outside of the function, while still allowing binaries conforming
to the ABI to interoperate with legacy binaries that do not sign or
authenticate pointers.
The idea is that a dynamic loader or early startup code would issue
this prctl very early after establishing that a process may load legacy
binaries, but before executing any PAC instructions.
This change adds a small amount of overhead to kernel entry and exit
due to additional required instruction sequences.
On a DragonBoard 845c (Cortex-A75) with the powersave governor, the
overhead of similar instruction sequences was measured as 4.9ns when
simulating the common case where IA is left enabled, or 43.7ns when
simulating the uncommon case where IA is disabled. These numbers can
be seen as the worst case scenario, since in more realistic scenarios
a better performing governor would be used and a newer chip would be
used that would support PAC unlike Cortex-A75 and would be expected
to be faster than Cortex-A75.
On an Apple M1 under a hypervisor, the overhead of the entry/exit
instruction sequences introduced by this patch was measured as 0.3ns
in the case where IA is left enabled, and 33.0ns in the case where
IA is disabled.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Link: https://linux-review.googlesource.com/id/Ibc41a5e6a76b275efbaa126b31119dc197b927a5
Link: https://lore.kernel.org/r/d6609065f8f40397a4124654eb68c9f490b4d477.1616123271.git.pcc@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Bug: 192536783
(cherry picked from commit 201698626f)
Change-Id: Ic0a21c92a22575f9ec3599fb67bd2931a50b9f04
[quic_eberman@quicinc.com: Resolved merge conflict in
arch/arm64/kernel/process.c]
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
Signed-off-by: Peter Collingbourne <pcc@google.com>
Psi polling mechanism is trying to minimize the number of wakeups to
run psi_poll_work and is currently relying on timer_pending() to detect
when this work is already scheduled. This provides a window of opportunity
for psi_group_change to schedule an immediate psi_poll_work after
poll_timer_fn got called but before psi_poll_work could reschedule itself.
Below is the depiction of this entire window:
poll_timer_fn
wake_up_interruptible(&group->poll_wait);
psi_poll_worker
wait_event_interruptible(group->poll_wait, ...)
psi_poll_work
psi_schedule_poll_work
if (timer_pending(&group->poll_timer)) return;
...
mod_timer(&group->poll_timer, jiffies + delay);
Prior to 461daba06b we used to rely on poll_scheduled atomic which was
reset and set back inside psi_poll_work and therefore this race window
was much smaller.
The larger window causes increased number of wakeups and our partners
report visible power regression of ~10mA after applying 461daba06b.
Bring back the poll_scheduled atomic and make this race window even
narrower by resetting poll_scheduled only when we reach polling expiration
time. This does not completely eliminate the possibility of extra wakeups
caused by a race with psi_group_change however it will limit it to the
worst case scenario of one extra wakeup per every tracking window (0.5s
in the worst case).
This patch also ensures correct ordering between clearing poll_scheduled
flag and obtaining changed_states using memory barrier. Correct ordering
between updating changed_states and setting poll_scheduled is ensured by
atomic_xchg operation.
By tracing the number of immediate rescheduling attempts performed by
psi_group_change and the number of these attempts being blocked due to
psi monitor being already active, we can assess the effects of this change:
Before the patch:
Run#1 Run#2 Run#3
Immediate reschedules attempted: 684365 1385156 1261240
Immediate reschedules blocked: 682846 1381654 1258682
Immediate reschedules (delta): 1519 3502 2558
Immediate reschedules (% of attempted): 0.22% 0.25% 0.20%
After the patch:
Run#1 Run#2 Run#3
Immediate reschedules attempted: 882244 770298 426218
Immediate reschedules blocked: 881996 769796 426074
Immediate reschedules (delta): 248 502 144
Immediate reschedules (% of attempted): 0.03% 0.07% 0.03%
The number of non-blocked immediate reschedules dropped from 0.22-0.25%
to 0.03-0.07%. The drop is attributed to the decrease in the race window
size and the fact that we allow this race only when psi monitors reach
polling window expiration time.
Fixes: 461daba06b ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
Reported-by: Kathleen Chang <yt.chang@mediatek.com>
Reported-by: Wenju Xu <wenju.xu@mediatek.com>
Reported-by: Jonathan Chen <jonathan.jmchen@mediatek.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: SH Chen <show-hong.chen@mediatek.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/patchwork/patch/1455172/
Bug: 191127654
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ie61547ca043e702442a9c6db1468cfb60ff2e729
We need dump task->stack in kernel module for debug usage,
call try_get_task_stack to lock task->stack, and
try_get_task_stack/put_task_stack should call in pairs,
but put_task_stack is not exported
Bug: 192990535
Change-Id: Ifb2f3d16f93039bffeb3e822bc066e42e2d21d13
Signed-off-by: chunhui.li <chunhui.li@mediatek.com>
Export cgroup_add_legacy_cftypes and a helper function to allow vendor module to expose additional files in the memory cgroup hierarchy.
Bug: 192052083
Signed-off-by: Liujie Xie <xieliujie@oppo.com>
Change-Id: Ie2b936b3e77c7ab6d740d1bb6d70e03c70a326a7
Through this vendor hook, we can get the timing to check
current running task for the validation of its credential
and bpf operations.
Bug: 191291287
Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Change-Id: Ie4ed8df7ad66df2486fc7e52a26d9191fc0c176e
android_rvh_sched_fork() and android_rvh_sched_fork_init()
already let us register probes during fork(), but those are
invoked *before* the new task is added to the tasklist, which
can lead to some undesired races when a module is trying to
initialize vendor-specific task_struct fields.
Export the task_newtask tracepoint to register probes to run
during fork() but *after* the task has been inserted into the
tasklist.
Bug: 192873984
Signed-off-by: Jing-Ting Wu <Jing-Ting.Wu@mediatek.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Change-Id: Ifef14819264385b5e955a5966b4e4f66d50da5e3
Fix warnings reported by kernelci due to incorrect indentatio:
kernel/smp.c:982:3: warning: this ‘if’ clause does not guard
Fixes: f0b280c395 ("ANDROID: cpuidle: Update cpuidle_uninstall_idle_handler()
to wakeup all online CPUs")
Signed-off-by: Todd Kjos <tkjos@google.com>
Change-Id: Ide771342558de321154696f9fe1272750a773853
commit 5f89468e2f upstream.
in case of driver wants to sync part of ranges with offset,
swiotlb_tbl_sync_single() copies from orig_addr base to tlb_addr with
offset and ends up with data mismatch.
It was removed from
"swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single",
but said logic has to be added back in.
From Linus's email:
"That commit which the removed the offset calculation entirely, because the old
(unsigned long)tlb_addr & (IO_TLB_SIZE - 1)
was wrong, but instead of removing it, I think it should have just
fixed it to be
(tlb_addr - mem->start) & (IO_TLB_SIZE - 1);
instead. That way the slot offset always matches the slot index calculation."
(Unfortunatly that broke NVMe).
The use-case that drivers are hitting is as follow:
1. Get dma_addr_t from dma_map_single()
dma_addr_t tlb_addr = dma_map_single(dev, vaddr, vsize, DMA_TO_DEVICE);
|<---------------vsize------------->|
+-----------------------------------+
| | original buffer
+-----------------------------------+
vaddr
swiotlb_align_offset
|<----->|<---------------vsize------------->|
+-------+-----------------------------------+
| | | swiotlb buffer
+-------+-----------------------------------+
tlb_addr
2. Do something
3. Sync dma_addr_t through dma_sync_single_for_device(..)
dma_sync_single_for_device(dev, tlb_addr + offset, size, DMA_TO_DEVICE);
Error case.
Copy data to original buffer but it is from base addr (instead of
base addr + offset) in original buffer:
swiotlb_align_offset
|<----->|<- offset ->|<- size ->|
+-------+-----------------------------------+
| | |##########| | swiotlb buffer
+-------+-----------------------------------+
tlb_addr
|<- size ->|
+-----------------------------------+
|##########| | original buffer
+-----------------------------------+
vaddr
The fix is to copy the data to the original buffer and take into
account the offset, like so:
swiotlb_align_offset
|<----->|<- offset ->|<- size ->|
+-------+-----------------------------------+
| | |##########| | swiotlb buffer
+-------+-----------------------------------+
tlb_addr
|<- offset ->|<- size ->|
+-----------------------------------+
| |##########| | original buffer
+-----------------------------------+
vaddr
[One fix which was Linus's that made more sense to as it created a
symmetry would break NVMe. The reason for that is the:
unsigned int offset = (tlb_addr - mem->start) & (IO_TLB_SIZE - 1);
would come up with the proper offset, but it would lose the
alignment (which this patch contains).]
Bug: 192521392
Fixes: 16fc3cef33 ("swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single")
Signed-off-by: Bumyong Lee <bumyong.lee@samsung.com>
Signed-off-by: Chanho Park <chanho61.park@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
Reported-by: Horia Geantă <horia.geanta@nxp.com>
Tested-by: Horia Geantă <horia.geanta@nxp.com>
CC: stable@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit e6108147dd of
linux-5.10.47)
Change-Id: Ib03e81080ab029d37e6ff54a3e2cb526d3a30e10
wake_up_all_idle_cpus() will not wakeup paused CPUs since they are removed
from cpu_active_mask but paused CPUs can be in deep cpu idle and hence must
wakeup when uninstalling idle handler.
This change fixes this by introducing wake_up_all_online_idle_cpus() to
unconditionally wakeup all online idle CPUs and invoking same when uninstalling
cpu idle handler.
Bug: 192436062
Fixes: 683010f555 ("ANDROID: cpu/hotplug: add pause/resume_cpus interface")
Change-Id: I4afd4b7a17b87f9cc495e7009c9537888387f9ef
Signed-off-by: Maulik Shah <mkshah@codeaurora.org>
For vendor specific data in struct cfs_rq.
Bug: 188947181
Signed-off-by: Rick Yiu <rickyiu@google.com>
Change-Id: I7c322c6812829c19014426b5721cd1fb0c37a53f
As restricted hooks have been introduced, regular vendor hooks are no
longer necessary.
Bug: 187917024
Change-Id: Ia70e9dd1bd7373e19bdc82e90a2384201076bc0b
Signed-off-by: Shaleen Agrawal <shalagra@codeaurora.org>
select_fallback_rq() must return a cpu that is valid for the task.
However, when nid is not -1, it skips checking for
task_cpu_possible_mask().
This causes a problem when execve-ing 32 bit apps on an asymmetric
system where not all cpus are 32 bit capable. During execve-ing
the task is marked as 32 bit long before its affinity mask is
restricted.
If the cpu goes offline during this time, select_fallback_rq()
could return a 64 bit only cpu, which __migrate_tasks()/
is_cpu_allowed() rejects.
migrate_tasks() will therefore continue to pick the same task
repeatedly, where __migrate_tasks() rejects the cpu chosen
by select_fallback_rq() every time, leading to an infinite loop.
Correct the issue by updating select_fallback_rq() for the case
where nid is not -1, ensuring that the returned cpu is always
valid for this task.
Bug: 192050156
Change-Id: Ia073a8395a02485f6d1c1daa0f3ce9e2029cb1f4
Signed-off-by: Stephen Dickey <dickey@codeaurora.org>
In order to update cpufreq, vendor modules invoke cpufreq_update_util(),
but when we build our modules, report error:
ERROR: modpost: "cpufreq_update_util_data" [xxx.ko] undefined!
Bug: 192218676
Signed-off-by: Liujie Xie <xieliujie@oppo.com>
Change-Id: Ib1da70229f04b08d8d812d065021dec0bf891e0e
Pre and post tracepoints in force_compatible_cpus_allowed_ptr() need
to be restricted hooks so that they can sleep.
The old non-restricted versions need to stay in place temporarily for
KMI stability. They will be removed by aosp/1742588.
Bug: 187917024
Change-Id: If630554b1c8fa2e8ccb79c89945c55e17756e6a8
Signed-off-by: Shaleen Agrawal <shalagra@codeaurora.org>
PSI accounts stalls for each cgroup separately and aggregates it at each
level of the hierarchy. This causes additional overhead with psi_avgs_work
being called for each cgroup in the hierarchy. psi_avgs_work has been
highly optimized, however on systems with large number of cgroups the
overhead becomes noticeable.
Systems which use PSI only at the system level could avoid this overhead
if PSI can be configured to skip per-cgroup stall accounting.
Add "cgroup_disable=pressure" kernel command-line option to allow
requesting system-wide only pressure stall accounting. When set, it
keeps system-wide accounting under /proc/pressure/ but skips accounting
for individual cgroups and does not expose PSI nodes in cgroup hierarchy.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/patchwork/patch/1435705
(cherry picked from commit 3958e2d0c3https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git tj)
Bug: 178872719
Bug: 191734423
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ifc8fbc52f9a1131d7c2668edbb44c525c76c3360
Some interrupts (such as the rescheduling IPI) rely on not going through
the irq_enter()/irq_exit() calls. To distinguish such interrupts, add
a new IRQ flag that allows the low-level handling code to sidestep the
enter()/exit() calls.
Only the architecture code is expected to use this. It will do the wrong
thing on normal interrupts. Note that this is a band-aid until we can
move to some more correct infrastructure (such as kernel/entry/common.c).
Bug: 191808738
Link: https://lore.kernel.org/lkml/20201124141449.572446-3-maz@kernel.org/
Change-Id: I0609a8b689219ba9e769c8b9f7fcf1e77a0ff1ca
Signed-off-by: Marc Zyngier <maz@kernel.org>
[minor port to 5.10]
Signed-off-by: Stephen Dickey <dickey@codeaurora.org>
Some arch-specific flags need to be set/cleared, but not exposed to
random device drivers. Introduce a new helper (__irq_modify_status())
that takes an arbitrary mask, and rewrite irq_modify_status() to use
this new helper.
No functionnal change.
Bug: 191808738
Link: https://lore.kernel.org/lkml/20201124141449.572446-5-maz@kernel.org/
Change-Id: I2c2c0d6599d0ab39fad22462bf4c87694362fba8
Signed-off-by: Marc Zyngier <maz@kernel.org>
[minor port to 5.10]
Signed-off-by: Stephen Dickey <dickey@codeaurora.org>
Add the vendor hook to qos.c, because of some special cases related to
our feature. we add the hook at freq_qos_add_request and remove_request
to make sure we can go to our own qos process logic.
Bug: 187458531
Signed-off-by: heshuai1 <heshuai1@xiaomi.com>
Change-Id: I1fb8fd6134432ecfb44ad242c66ccd8280ab9b43
The proactive compaction[1] gets triggered for every 500msec and run
compaction on the node for COMPACTION_HPAGE_ORDER (usually order-9)
pages based on the value set to sysctl.compaction_proactiveness.
Triggering the compaction for every 500msec in search of
COMPACTION_HPAGE_ORDER pages is not needed for all applications,
especially on the embedded system usecases which may have few MB's of
RAM. Enabling the proactive compaction in its state will endup in
running almost always on such systems.
Other side, proactive compaction can still be very much useful for
getting a set of higher order pages in some controllable
manner(controlled by using the sysctl.compaction_proactiveness). Thus on
systems where enabling the proactive compaction always may proove not
required, can trigger the same from user space on write to its sysctl
interface. As an example, say app launcher decide to launch the memory
heavy application which can be launched fast if it gets more higher
order pages thus launcher can prepare the system in advance by
triggering the proactive compaction from userspace.
This triggering of proactive compaction is done on a write to
sysctl.compaction_proactiveness by user.
[1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=facdaa917c4d5a376d09d25865f5a863f906234a
Bug: 186387247
Link: https://lore.kernel.org/patchwork/patch/1438211/
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Change-Id: Ie5208e274b9d7e7354471bb98ff1f10becf93595
Changes in 5.10.43
btrfs: tree-checker: do not error out if extent ref hash doesn't match
net: usb: cdc_ncm: don't spew notifications
hwmon: (dell-smm-hwmon) Fix index values
hwmon: (pmbus/isl68137) remove READ_TEMPERATURE_3 for RAA228228
netfilter: conntrack: unregister ipv4 sockopts on error unwind
efi/fdt: fix panic when no valid fdt found
efi: Allow EFI_MEMORY_XP and EFI_MEMORY_RO both to be cleared
efi/libstub: prevent read overflow in find_file_option()
efi: cper: fix snprintf() use in cper_dimm_err_location()
vfio/pci: Fix error return code in vfio_ecap_init()
vfio/pci: zap_vma_ptes() needs MMU
samples: vfio-mdev: fix error handing in mdpy_fb_probe()
vfio/platform: fix module_put call in error flow
ipvs: ignore IP_VS_SVC_F_HASHED flag when adding service
HID: logitech-hidpp: initialize level variable
HID: pidff: fix error return code in hid_pidff_init()
HID: i2c-hid: fix format string mismatch
devlink: Correct VIRTUAL port to not have phys_port attributes
net/sched: act_ct: Offload connections with commit action
net/sched: act_ct: Fix ct template allocation for zone 0
mptcp: always parse mptcp options for MPC reqsk
nvme-rdma: fix in-casule data send for chained sgls
ACPICA: Clean up context mutex during object deletion
perf probe: Fix NULL pointer dereference in convert_variable_location()
net: dsa: tag_8021q: fix the VLAN IDs used for encoding sub-VLANs
net: sock: fix in-kernel mark setting
net/tls: Replace TLS_RX_SYNC_RUNNING with RCU
net/tls: Fix use-after-free after the TLS device goes down and up
net/mlx5e: Fix incompatible casting
net/mlx5: Check firmware sync reset requested is set before trying to abort it
net/mlx5e: Check for needed capability for cvlan matching
net/mlx5: DR, Create multi-destination flow table with level less than 64
nvmet: fix freeing unallocated p2pmem
netfilter: nft_ct: skip expectations for confirmed conntrack
netfilter: nfnetlink_cthelper: hit EBUSY on updates if size mismatches
drm/i915/selftests: Fix return value check in live_breadcrumbs_smoketest()
bpf: Simplify cases in bpf_base_func_proto
bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks
ieee802154: fix error return code in ieee802154_add_iface()
ieee802154: fix error return code in ieee802154_llsec_getparams()
igb: add correct exception tracing for XDP
ixgbevf: add correct exception tracing for XDP
cxgb4: fix regression with HASH tc prio value update
ipv6: Fix KASAN: slab-out-of-bounds Read in fib6_nh_flush_exceptions
ice: Fix allowing VF to request more/less queues via virtchnl
ice: Fix VFR issues for AVF drivers that expect ATQLEN cleared
ice: handle the VF VSI rebuild failure
ice: report supported and advertised autoneg using PHY capabilities
ice: Allow all LLDP packets from PF to Tx
i2c: qcom-geni: Add shutdown callback for i2c
cxgb4: avoid link re-train during TC-MQPRIO configuration
i40e: optimize for XDP_REDIRECT in xsk path
i40e: add correct exception tracing for XDP
ice: simplify ice_run_xdp
ice: optimize for XDP_REDIRECT in xsk path
ice: add correct exception tracing for XDP
ixgbe: optimize for XDP_REDIRECT in xsk path
ixgbe: add correct exception tracing for XDP
arm64: dts: ti: j7200-main: Mark Main NAVSS as dma-coherent
optee: use export_uuid() to copy client UUID
bus: ti-sysc: Fix am335x resume hang for usb otg module
arm64: dts: ls1028a: fix memory node
arm64: dts: zii-ultra: fix 12V_MAIN voltage
arm64: dts: freescale: sl28: var4: fix RGMII clock and voltage
ARM: dts: imx7d-meerkat96: Fix the 'tuning-step' property
ARM: dts: imx7d-pico: Fix the 'tuning-step' property
ARM: dts: imx: emcon-avari: Fix nxp,pca8574 #gpio-cells
bus: ti-sysc: Fix flakey idling of uarts and stop using swsup_sidle_act
tipc: add extack messages for bearer/media failure
tipc: fix unique bearer names sanity check
serial: stm32: fix threaded interrupt handling
riscv: vdso: fix and clean-up Makefile
io_uring: fix link timeout refs
io_uring: use better types for cflags
drm/amdgpu/vcn3: add cancel_delayed_work_sync before power gate
drm/amdgpu/jpeg2.5: add cancel_delayed_work_sync before power gate
drm/amdgpu/jpeg3: add cancel_delayed_work_sync before power gate
Bluetooth: fix the erroneous flush_work() order
Bluetooth: use correct lock to prevent UAF of hdev object
wireguard: do not use -O3
wireguard: peer: allocate in kmem_cache
wireguard: use synchronize_net rather than synchronize_rcu
wireguard: selftests: remove old conntrack kconfig value
wireguard: selftests: make sure rp_filter is disabled on vethc
wireguard: allowedips: initialize list head in selftest
wireguard: allowedips: remove nodes in O(1)
wireguard: allowedips: allocate nodes in kmem_cache
wireguard: allowedips: free empty intermediate nodes when removing single node
net: caif: added cfserl_release function
net: caif: add proper error handling
net: caif: fix memory leak in caif_device_notify
net: caif: fix memory leak in cfusbl_device_notify
HID: i2c-hid: Skip ELAN power-on command after reset
HID: magicmouse: fix NULL-deref on disconnect
HID: multitouch: require Finger field to mark Win8 reports as MT
gfs2: fix scheduling while atomic bug in glocks
ALSA: timer: Fix master timer notification
ALSA: hda: Fix for mute key LED for HP Pavilion 15-CK0xx
ALSA: hda: update the power_state during the direct-complete
ARM: dts: imx6dl-yapp4: Fix RGMII connection to QCA8334 switch
ARM: dts: imx6q-dhcom: Add PU,VDD1P1,VDD2P5 regulators
ext4: fix memory leak in ext4_fill_super
ext4: fix bug on in ext4_es_cache_extent as ext4_split_extent_at failed
ext4: fix fast commit alignment issues
ext4: fix memory leak in ext4_mb_init_backend on error path.
ext4: fix accessing uninit percpu counter variable with fast_commit
usb: dwc2: Fix build in periphal-only mode
pid: take a reference when initializing `cad_pid`
ocfs2: fix data corruption by fallocate
mm/debug_vm_pgtable: fix alignment for pmd/pud_advanced_tests()
mm/page_alloc: fix counting of free pages after take off from buddy
x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid()
x86/sev: Check SME/SEV support in CPUID first
nfc: fix NULL ptr dereference in llcp_sock_getname() after failed connect
drm/amdgpu: Don't query CE and UE errors
drm/amdgpu: make sure we unpin the UVD BO
x86/apic: Mark _all_ legacy interrupts when IO/APIC is missing
powerpc/kprobes: Fix validation of prefixed instructions across page boundary
btrfs: mark ordered extent and inode with error if we fail to finish
btrfs: fix error handling in btrfs_del_csums
btrfs: return errors from btrfs_del_csums in cleanup_ref_head
btrfs: fixup error handling in fixup_inode_link_counts
btrfs: abort in rename_exchange if we fail to insert the second ref
btrfs: fix deadlock when cloning inline extents and low on available space
mm, hugetlb: fix simple resv_huge_pages underflow on UFFDIO_COPY
drm/msm/dpu: always use mdp device to scale bandwidth
btrfs: fix unmountable seed device after fstrim
KVM: SVM: Truncate GPR value for DR and CR accesses in !64-bit mode
KVM: arm64: Fix debug register indexing
x86/kvm: Teardown PV features on boot CPU as well
x86/kvm: Disable kvmclock on all CPUs on shutdown
x86/kvm: Disable all PV features on crash
lib/lz4: explicitly support in-place decompression
i2c: qcom-geni: Suspend and resume the bus during SYSTEM_SLEEP_PM ops
netfilter: nf_tables: missing error reporting for not selected expressions
xen-netback: take a reference to the RX task thread
neighbour: allow NUD_NOARP entries to be forced GCed
Linux 5.10.43
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I8d7ec0878193e4e454076809b7fb71fcc4e3d810
Exporting the symbol cpuset_cpus_allowed(), in which ko module can do
cpuset operation in vendor hook related code.
Bug: 189725786
Signed-off-by: lijianzhong <lijianzhong@xiaomi.com>
Change-Id: I7919a893ab64bb441ab43cbb0b16825ed76d802d
[ Upstream commit ff40e51043 ]
Commit 59438b4647 ("security,lockdown,selinux: implement SELinux lockdown")
added an implementation of the locked_down LSM hook to SELinux, with the aim
to restrict which domains are allowed to perform operations that would breach
lockdown. This is indirectly also getting audit subsystem involved to report
events. The latter is problematic, as reported by Ondrej and Serhei, since it
can bring down the whole system via audit:
1) The audit events that are triggered due to calls to security_locked_down()
can OOM kill a machine, see below details [0].
2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit()
when trying to wake up kauditd, for example, when using trace_sched_switch()
tracepoint, see details in [1]. Triggering this was not via some hypothetical
corner case, but with existing tools like runqlat & runqslower from bcc, for
example, which make use of this tracepoint. Rough call sequence goes like:
rq_lock(rq) -> -------------------------+
trace_sched_switch() -> |
bpf_prog_xyz() -> +-> deadlock
selinux_lockdown() -> |
audit_log_end() -> |
wake_up_interruptible() -> |
try_to_wake_up() -> |
rq_lock(rq) --------------+
What's worse is that the intention of 59438b4647 to further restrict lockdown
settings for specific applications in respect to the global lockdown policy is
completely broken for BPF. The SELinux policy rule for the current lockdown check
looks something like this:
allow <who> <who> : lockdown { <reason> };
However, this doesn't match with the 'current' task where the security_locked_down()
is executed, example: httpd does a syscall. There is a tracing program attached
to the syscall which triggers a BPF program to run, which ends up doing a
bpf_probe_read_kernel{,_str}() helper call. The selinux_lockdown() hook does
the permission check against 'current', that is, httpd in this example. httpd
has literally zero relation to this tracing program, and it would be nonsensical
having to write an SELinux policy rule against httpd to let the tracing helper
pass. The policy in this case needs to be against the entity that is installing
the BPF program. For example, if bpftrace would generate a histogram of syscall
counts by user space application:
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
bpftrace would then go and generate a BPF program from this internally. One way
of doing it [for the sake of the example] could be to call bpf_get_current_task()
helper and then access current->comm via one of bpf_probe_read_kernel{,_str}()
helpers. So the program itself has nothing to do with httpd or any other random
app doing a syscall here. The BPF program _explicitly initiated_ the lockdown
check. The allow/deny policy belongs in the context of bpftrace: meaning, you
want to grant bpftrace access to use these helpers, but other tracers on the
system like my_random_tracer _not_.
Therefore fix all three issues at the same time by taking a completely different
approach for the security_locked_down() hook, that is, move the check into the
program verification phase where we actually retrieve the BPF func proto. This
also reliably gets the task (current) that is trying to install the BPF tracing
program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also fixes the OOM since
we're moving this out of the BPF helper's fast-path which can be called several
millions of times per second.
The check is then also in line with other security_locked_down() hooks in the
system where the enforcement is performed at open/load time, for example,
open_kcore() for /proc/kcore access or module_sig_check() for module signatures
just to pick few random ones. What's out of scope in the fix as well as in
other security_locked_down() hook locations /outside/ of BPF subsystem is that
if the lockdown policy changes on the fly there is no retrospective action.
This requires a different discussion, potentially complex infrastructure, and
it's also not clear whether this can be solved generically. Either way, it is
out of scope for a suitable stable fix which this one is targeting. Note that
the breakage is specifically on 59438b4647 where it started to rely on 'current'
as UAPI behavior, and _not_ earlier infrastructure such as 9d1f8be5cf ("bpf:
Restrict bpf when kernel lockdown is in confidentiality mode").
[0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says:
I starting seeing this with F-34. When I run a container that is traced with
BPF to record the syscalls it is doing, auditd is flooded with messages like:
type=AVC msg=audit(1619784520.593:282387): avc: denied { confidentiality }
for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM"
scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0
tclass=lockdown permissive=0
This seems to be leading to auditd running out of space in the backlog buffer
and eventually OOMs the machine.
[...]
auditd running at 99% CPU presumably processing all the messages, eventually I get:
Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64
Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64
Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64
Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64
Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1
Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
[...]
[1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/,
Serhei Makarov says:
Upstream kernel 5.11.0-rc7 and later was found to deadlock during a
bpf_probe_read_compat() call within a sched_switch tracepoint. The problem
is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend
testsuite on x86_64 as well as the runqlat, runqslower tools from bcc on
ppc64le. Example stack trace:
[...]
[ 730.868702] stack backtrace:
[ 730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
[ 730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
[ 730.873278] Call Trace:
[ 730.873770] dump_stack+0x7f/0xa1
[ 730.874433] check_noncircular+0xdf/0x100
[ 730.875232] __lock_acquire+0x1202/0x1e10
[ 730.876031] ? __lock_acquire+0xfc0/0x1e10
[ 730.876844] lock_acquire+0xc2/0x3a0
[ 730.877551] ? __wake_up_common_lock+0x52/0x90
[ 730.878434] ? lock_acquire+0xc2/0x3a0
[ 730.879186] ? lock_is_held_type+0xa7/0x120
[ 730.880044] ? skb_queue_tail+0x1b/0x50
[ 730.880800] _raw_spin_lock_irqsave+0x4d/0x90
[ 730.881656] ? __wake_up_common_lock+0x52/0x90
[ 730.882532] __wake_up_common_lock+0x52/0x90
[ 730.883375] audit_log_end+0x5b/0x100
[ 730.884104] slow_avc_audit+0x69/0x90
[ 730.884836] avc_has_perm+0x8b/0xb0
[ 730.885532] selinux_lockdown+0xa5/0xd0
[ 730.886297] security_locked_down+0x20/0x40
[ 730.887133] bpf_probe_read_compat+0x66/0xd0
[ 730.887983] bpf_prog_250599c5469ac7b5+0x10f/0x820
[ 730.888917] trace_call_bpf+0xe9/0x240
[ 730.889672] perf_trace_run_bpf_submit+0x4d/0xc0
[ 730.890579] perf_trace_sched_switch+0x142/0x180
[ 730.891485] ? __schedule+0x6d8/0xb20
[ 730.892209] __schedule+0x6d8/0xb20
[ 730.892899] schedule+0x5b/0xc0
[ 730.893522] exit_to_user_mode_prepare+0x11d/0x240
[ 730.894457] syscall_exit_to_user_mode+0x27/0x70
[ 730.895361] entry_SYSCALL_64_after_hwframe+0x44/0xae
[...]
Fixes: 59438b4647 ("security,lockdown,selinux: implement SELinux lockdown")
Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
Reported-by: Jakub Hrozek <jhrozek@redhat.com>
Reported-by: Serhei Makarov <smakarov@redhat.com>
Reported-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Frank Eigler <fche@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/bpf/01135120-8bf7-df2e-cff0-1d73f1f841c3@iogearbox.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 61ca36c8c4 ]
!perfmon_capable() is checked before the last switch(func_id) in
bpf_base_func_proto. Thus, the cases BPF_FUNC_trace_printk and
BPF_FUNC_snprintf_btf can be moved to that last switch(func_id) to omit
the inline !perfmon_capable() checks.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210127174615.3038-1-tklauser@distanz.ch
Signed-off-by: Sasha Levin <sashal@kernel.org>
Add the vendor hook to user.c, because of some speical cases related to
our feature, we need to initialize the variables defined by ourselves in
user_struct, so we add the hook at alloc_uid to make sure we can go to
our own logic when the user_struct is about to initialize.
Bug: 187458531
Signed-off-by: heshuai1 <heshuai1@xiaomi.com>
Change-Id: I078484aac2c3d396aba5971d6d0f491652f3781c
and sched_waking to let module probe them
Get task info about sleep and waking
Bug: 190422437
Signed-off-by: Liujie Xie <xieliujie@oppo.com>
Change-Id: I828c93f531f84e6133c2c3a7f8faada51683afcf
Module code would like to hold some locks when affinity is being updated
for 32 bit task exec.
Create pre and post tracepoints in force_compatible_cpus_allowed_ptr()
Bug: 187917024
Change-Id: I95bff9f4d5b5d37c1d5440acbd6857d2855c2b43
Signed-off-by: Abhijeet Dharmapurikar <adharmap@codeaurora.org>
Signed-off-by: Shaleen Agrawal <shalagra@codeaurora.org>
Add the vendor hook to freezer.c, because of some special cases related to our feature, we do not want the process to be frozen immediately, so we add the hook at __refrigerator to make sure we can go to our own freeze logic when the process is about to be frozen.
Bug: 187458531
Signed-off-by: heshuai1 <heshuai1@xiaomi.com>
Change-Id: Iea42fd9604d6b33ccd6502425416f0dd28eecebb
Add android_rvh_find_new_ilb to select a next ilb cpu for vendors.
Bug: 190228983
Change-Id: Iba1a0cd9cdc22dcf628dd33f8d838fe513a4818f
Signed-off-by: Choonghoon Park <choong.park@samsung.com>
Add ANDROID_OEM_DATA to struct rq, which is used to implement oem's
scheduler tuning.
Bug: 188899490
Change-Id: I1904b4fd83effc4b309bfb98811e9718398504f4
Signed-off-by: Liangliang Li <liliangliang@vivo.com>
With the introduction of per-cpu wakeup devices that can be used in
preference to the broadcast timer, print the name of such devices when
they are available.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210524221818.15850-6-will@kernel.org
(cherry picked from commit 245a057fee tip/tip.git timers/core)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 185092876
Change-Id: I39736cb43702430b722382c802603fdc4188a5c4
When configuring the broadcast timer on entry to and exit from deep idle
states, prefer a per-CPU wakeup timer if one exists.
On entry to idle, stop the tick device and transfer the next event into
the oneshot wakeup device, which will serve as the wakeup from idle. To
avoid the overhead of additional hardware accesses on exit from idle,
leave the timer armed and treat the inevitable interrupt as a (possibly
spurious) tick event.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210524221818.15850-5-will@kernel.org
(cherry picked from commit ea5c7f1b9a tip/tip.git timers/core)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 185092876
Change-Id: I62a49231e213285f95e9f0cf6a07633984930b56
Some SoCs have two per-cpu timer implementations where the timer with the
higher rating stops in deep idle (i.e. suffers from CLOCK_EVT_FEAT_C3STOP)
but is otherwise preferable to the timer with the lower rating. In such a
design, selecting the higher rated devices relies on a global broadcast
timer and IPIs to wake up from deep idle states.
To avoid the reliance on a global broadcast timer and also to reduce the
overhead associated with the IPI wakeups, extend
tick_install_broadcast_device() to manage per-cpu wakeup timers separately
from the broadcast device.
For now, these timers remain unused.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210524221818.15850-4-will@kernel.org
(cherry picked from commit c94a8537df tip/tip.git timers/core)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 185092876
Change-Id: I2d2b1bc6333d004846270d3e58dec0dca89a89d1