Merge branch 'power-supply-scope' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen
This commit is contained in:
commit
251f39fe42
10955 changed files with 604178 additions and 381478 deletions
|
@ -26,6 +26,8 @@ s2ram.txt
|
|||
- How to get suspend to ram working (and debug it when it isn't)
|
||||
states.txt
|
||||
- System power management states
|
||||
suspend-and-cpuhotplug.txt
|
||||
- Explains the interaction between Suspend-to-RAM (S3) and CPU hotplug
|
||||
swsusp-and-swap-files.txt
|
||||
- Using swap files with software suspend (to disk)
|
||||
swsusp-dmcrypt.txt
|
||||
|
|
|
@ -173,7 +173,7 @@ kernel messages using the serial console. This may provide you with some
|
|||
information about the reasons of the suspend (resume) failure. Alternatively,
|
||||
it may be possible to use a FireWire port for debugging with firescope
|
||||
(ftp://ftp.firstfloor.org/pub/ak/firescope/). On x86 it is also possible to
|
||||
use the PM_TRACE mechanism documented in Documentation/s2ram.txt .
|
||||
use the PM_TRACE mechanism documented in Documentation/power/s2ram.txt .
|
||||
|
||||
2. Testing suspend to RAM (STR)
|
||||
|
||||
|
@ -201,3 +201,27 @@ case, you may be able to search for failing drivers by following the procedure
|
|||
analogous to the one described in section 1. If you find some failing drivers,
|
||||
you will have to unload them every time before an STR transition (ie. before
|
||||
you run s2ram), and please report the problems with them.
|
||||
|
||||
There is a debugfs entry which shows the suspend to RAM statistics. Here is an
|
||||
example of its output.
|
||||
# mount -t debugfs none /sys/kernel/debug
|
||||
# cat /sys/kernel/debug/suspend_stats
|
||||
success: 20
|
||||
fail: 5
|
||||
failed_freeze: 0
|
||||
failed_prepare: 0
|
||||
failed_suspend: 5
|
||||
failed_suspend_noirq: 0
|
||||
failed_resume: 0
|
||||
failed_resume_noirq: 0
|
||||
failures:
|
||||
last_failed_dev: alarm
|
||||
adc
|
||||
last_failed_errno: -16
|
||||
-16
|
||||
last_failed_step: suspend
|
||||
suspend
|
||||
Field success means the success number of suspend to RAM, and field fail means
|
||||
the failure number. Others are the failure number of different steps of suspend
|
||||
to RAM. suspend_stats just lists the last 2 failed devices, error number and
|
||||
failed step of suspend.
|
||||
|
|
|
@ -152,7 +152,9 @@ try to use its wakeup mechanism. device_set_wakeup_enable() affects this flag;
|
|||
for the most part drivers should not change its value. The initial value of
|
||||
should_wakeup is supposed to be false for the majority of devices; the major
|
||||
exceptions are power buttons, keyboards, and Ethernet adapters whose WoL
|
||||
(wake-on-LAN) feature has been set up with ethtool.
|
||||
(wake-on-LAN) feature has been set up with ethtool. It should also default
|
||||
to true for devices that don't generate wakeup requests on their own but merely
|
||||
forward wakeup requests from one bus to another (like PCI bridges).
|
||||
|
||||
Whether or not a device is capable of issuing wakeup events is a hardware
|
||||
matter, and the kernel is responsible for keeping track of it. By contrast,
|
||||
|
@ -279,10 +281,6 @@ When the system goes into the standby or memory sleep state, the phases are:
|
|||
time.) Unlike the other suspend-related phases, during the prepare
|
||||
phase the device tree is traversed top-down.
|
||||
|
||||
In addition to that, if device drivers need to allocate additional
|
||||
memory to be able to hadle device suspend correctly, that should be
|
||||
done in the prepare phase.
|
||||
|
||||
After the prepare callback method returns, no new children may be
|
||||
registered below the device. The method may also prepare the device or
|
||||
driver in some way for the upcoming system power transition (for
|
||||
|
|
|
@ -22,12 +22,12 @@ try_to_freeze_tasks() that sets TIF_FREEZE for all of the freezable tasks and
|
|||
either wakes them up, if they are kernel threads, or sends fake signals to them,
|
||||
if they are user space processes. A task that has TIF_FREEZE set, should react
|
||||
to it by calling the function called refrigerator() (defined in
|
||||
kernel/power/process.c), which sets the task's PF_FROZEN flag, changes its state
|
||||
kernel/freezer.c), which sets the task's PF_FROZEN flag, changes its state
|
||||
to TASK_UNINTERRUPTIBLE and makes it loop until PF_FROZEN is cleared for it.
|
||||
Then, we say that the task is 'frozen' and therefore the set of functions
|
||||
handling this mechanism is referred to as 'the freezer' (these functions are
|
||||
defined in kernel/power/process.c and include/linux/freezer.h). User space
|
||||
processes are generally frozen before kernel threads.
|
||||
defined in kernel/power/process.c, kernel/freezer.c & include/linux/freezer.h).
|
||||
User space processes are generally frozen before kernel threads.
|
||||
|
||||
It is not recommended to call refrigerator() directly. Instead, it is
|
||||
recommended to use the try_to_freeze() function (defined in
|
||||
|
@ -95,7 +95,7 @@ after the memory for the image has been freed, we don't want tasks to allocate
|
|||
additional memory and we prevent them from doing that by freezing them earlier.
|
||||
[Of course, this also means that device drivers should not allocate substantial
|
||||
amounts of memory from their .suspend() callbacks before hibernation, but this
|
||||
is e separate issue.]
|
||||
is a separate issue.]
|
||||
|
||||
3. The third reason is to prevent user space processes and some kernel threads
|
||||
from interfering with the suspending and resuming of devices. A user space
|
||||
|
|
|
@ -4,14 +4,19 @@ This interface provides a kernel and user mode interface for registering
|
|||
performance expectations by drivers, subsystems and user space applications on
|
||||
one of the parameters.
|
||||
|
||||
Currently we have {cpu_dma_latency, network_latency, network_throughput} as the
|
||||
initial set of pm_qos parameters.
|
||||
Two different PM QoS frameworks are available:
|
||||
1. PM QoS classes for cpu_dma_latency, network_latency, network_throughput.
|
||||
2. the per-device PM QoS framework provides the API to manage the per-device latency
|
||||
constraints.
|
||||
|
||||
Each parameters have defined units:
|
||||
* latency: usec
|
||||
* timeout: usec
|
||||
* throughput: kbs (kilo bit / sec)
|
||||
|
||||
|
||||
1. PM QoS framework
|
||||
|
||||
The infrastructure exposes multiple misc device nodes one per implemented
|
||||
parameter. The set of parameters implement is defined by pm_qos_power_init()
|
||||
and pm_qos_params.h. This is done because having the available parameters
|
||||
|
@ -23,14 +28,18 @@ an aggregated target value. The aggregated target value is updated with
|
|||
changes to the request list or elements of the list. Typically the
|
||||
aggregated target value is simply the max or min of the request values held
|
||||
in the parameter list elements.
|
||||
Note: the aggregated target value is implemented as an atomic variable so that
|
||||
reading the aggregated value does not require any locking mechanism.
|
||||
|
||||
|
||||
From kernel mode the use of this interface is simple:
|
||||
|
||||
handle = pm_qos_add_request(param_class, target_value):
|
||||
Will insert an element into the list for that identified PM_QOS class with the
|
||||
void pm_qos_add_request(handle, param_class, target_value):
|
||||
Will insert an element into the list for that identified PM QoS class with the
|
||||
target value. Upon change to this list the new target is recomputed and any
|
||||
registered notifiers are called only if the target value is now different.
|
||||
Clients of pm_qos need to save the returned handle.
|
||||
Clients of pm_qos need to save the returned handle for future use in other
|
||||
pm_qos API functions.
|
||||
|
||||
void pm_qos_update_request(handle, new_target_value):
|
||||
Will update the list element pointed to by the handle with the new target value
|
||||
|
@ -42,6 +51,20 @@ Will remove the element. After removal it will update the aggregate target and
|
|||
call the notification tree if the target was changed as a result of removing
|
||||
the request.
|
||||
|
||||
int pm_qos_request(param_class):
|
||||
Returns the aggregated value for a given PM QoS class.
|
||||
|
||||
int pm_qos_request_active(handle):
|
||||
Returns if the request is still active, i.e. it has not been removed from a
|
||||
PM QoS class constraints list.
|
||||
|
||||
int pm_qos_add_notifier(param_class, notifier):
|
||||
Adds a notification callback function to the PM QoS class. The callback is
|
||||
called when the aggregated value for the PM QoS class is changed.
|
||||
|
||||
int pm_qos_remove_notifier(int param_class, notifier):
|
||||
Removes the notification callback function for the PM QoS class.
|
||||
|
||||
|
||||
From user mode:
|
||||
Only processes can register a pm_qos request. To provide for automatic
|
||||
|
@ -63,4 +86,63 @@ To remove the user mode request for a target value simply close the device
|
|||
node.
|
||||
|
||||
|
||||
2. PM QoS per-device latency framework
|
||||
|
||||
For each device a list of performance requests is maintained along with
|
||||
an aggregated target value. The aggregated target value is updated with
|
||||
changes to the request list or elements of the list. Typically the
|
||||
aggregated target value is simply the max or min of the request values held
|
||||
in the parameter list elements.
|
||||
Note: the aggregated target value is implemented as an atomic variable so that
|
||||
reading the aggregated value does not require any locking mechanism.
|
||||
|
||||
|
||||
From kernel mode the use of this interface is the following:
|
||||
|
||||
int dev_pm_qos_add_request(device, handle, value):
|
||||
Will insert an element into the list for that identified device with the
|
||||
target value. Upon change to this list the new target is recomputed and any
|
||||
registered notifiers are called only if the target value is now different.
|
||||
Clients of dev_pm_qos need to save the handle for future use in other
|
||||
dev_pm_qos API functions.
|
||||
|
||||
int dev_pm_qos_update_request(handle, new_value):
|
||||
Will update the list element pointed to by the handle with the new target value
|
||||
and recompute the new aggregated target, calling the notification trees if the
|
||||
target is changed.
|
||||
|
||||
int dev_pm_qos_remove_request(handle):
|
||||
Will remove the element. After removal it will update the aggregate target and
|
||||
call the notification trees if the target was changed as a result of removing
|
||||
the request.
|
||||
|
||||
s32 dev_pm_qos_read_value(device):
|
||||
Returns the aggregated value for a given device's constraints list.
|
||||
|
||||
|
||||
Notification mechanisms:
|
||||
The per-device PM QoS framework has 2 different and distinct notification trees:
|
||||
a per-device notification tree and a global notification tree.
|
||||
|
||||
int dev_pm_qos_add_notifier(device, notifier):
|
||||
Adds a notification callback function for the device.
|
||||
The callback is called when the aggregated value of the device constraints list
|
||||
is changed.
|
||||
|
||||
int dev_pm_qos_remove_notifier(device, notifier):
|
||||
Removes the notification callback function for the device.
|
||||
|
||||
int dev_pm_qos_add_global_notifier(notifier):
|
||||
Adds a notification callback function in the global notification tree of the
|
||||
framework.
|
||||
The callback is called when the aggregated value for any device is changed.
|
||||
|
||||
int dev_pm_qos_remove_global_notifier(notifier):
|
||||
Removes the notification callback function from the global notification tree
|
||||
of the framework.
|
||||
|
||||
|
||||
From user mode:
|
||||
No API for user space access to the per-device latency constraints is provided
|
||||
yet - still under discussion.
|
||||
|
||||
|
|
|
@ -16,7 +16,7 @@ initialisation code by creating a struct regulator_consumer_supply for
|
|||
each regulator.
|
||||
|
||||
struct regulator_consumer_supply {
|
||||
struct device *dev; /* consumer */
|
||||
const char *dev_name; /* consumer dev_name() */
|
||||
const char *supply; /* consumer supply - e.g. "vcc" */
|
||||
};
|
||||
|
||||
|
@ -24,13 +24,13 @@ e.g. for the machine above
|
|||
|
||||
static struct regulator_consumer_supply regulator1_consumers[] = {
|
||||
{
|
||||
.dev = &platform_consumerB_device.dev,
|
||||
.supply = "Vcc",
|
||||
.dev_name = "dev_name(consumer B)",
|
||||
.supply = "Vcc",
|
||||
},};
|
||||
|
||||
static struct regulator_consumer_supply regulator2_consumers[] = {
|
||||
{
|
||||
.dev = &platform_consumerA_device.dev,
|
||||
.dev = "dev_name(consumer A"),
|
||||
.supply = "Vcc",
|
||||
},};
|
||||
|
||||
|
@ -43,6 +43,7 @@ to their supply regulator :-
|
|||
|
||||
static struct regulator_init_data regulator1_data = {
|
||||
.constraints = {
|
||||
.name = "Regulator-1",
|
||||
.min_uV = 3300000,
|
||||
.max_uV = 3300000,
|
||||
.valid_modes_mask = REGULATOR_MODE_NORMAL,
|
||||
|
@ -51,13 +52,19 @@ static struct regulator_init_data regulator1_data = {
|
|||
.consumer_supplies = regulator1_consumers,
|
||||
};
|
||||
|
||||
The name field should be set to something that is usefully descriptive
|
||||
for the board for configuration of supplies for other regulators and
|
||||
for use in logging and other diagnostic output. Normally the name
|
||||
used for the supply rail in the schematic is a good choice. If no
|
||||
name is provided then the subsystem will choose one.
|
||||
|
||||
Regulator-1 supplies power to Regulator-2. This relationship must be registered
|
||||
with the core so that Regulator-1 is also enabled when Consumer A enables its
|
||||
supply (Regulator-2). The supply regulator is set by the supply_regulator
|
||||
field below:-
|
||||
field below and co:-
|
||||
|
||||
static struct regulator_init_data regulator2_data = {
|
||||
.supply_regulator = "regulator_name",
|
||||
.supply_regulator = "Regulator-1",
|
||||
.constraints = {
|
||||
.min_uV = 1800000,
|
||||
.max_uV = 2000000,
|
||||
|
|
|
@ -43,13 +43,18 @@ struct dev_pm_ops {
|
|||
...
|
||||
};
|
||||
|
||||
The ->runtime_suspend(), ->runtime_resume() and ->runtime_idle() callbacks are
|
||||
executed by the PM core for either the device type, or the class (if the device
|
||||
type's struct dev_pm_ops object does not exist), or the bus type (if the
|
||||
device type's and class' struct dev_pm_ops objects do not exist) of the given
|
||||
device (this allows device types to override callbacks provided by bus types or
|
||||
classes if necessary). The bus type, device type and class callbacks are
|
||||
referred to as subsystem-level callbacks in what follows.
|
||||
The ->runtime_suspend(), ->runtime_resume() and ->runtime_idle() callbacks
|
||||
are executed by the PM core for either the power domain, or the device type
|
||||
(if the device power domain's struct dev_pm_ops does not exist), or the class
|
||||
(if the device power domain's and type's struct dev_pm_ops object does not
|
||||
exist), or the bus type (if the device power domain's, type's and class'
|
||||
struct dev_pm_ops objects do not exist) of the given device, so the priority
|
||||
order of callbacks from high to low is that power domain callbacks, device
|
||||
type callbacks, class callbacks and bus type callbacks, and the high priority
|
||||
one will take precedence over low priority one. The bus type, device type and
|
||||
class callbacks are referred to as subsystem-level callbacks in what follows,
|
||||
and generally speaking, the power domain callbacks are used for representing
|
||||
power domains within a SoC.
|
||||
|
||||
By default, the callbacks are always invoked in process context with interrupts
|
||||
enabled. However, subsystems can use the pm_runtime_irq_safe() helper function
|
||||
|
@ -477,12 +482,14 @@ pm_runtime_autosuspend_expiration()
|
|||
If pm_runtime_irq_safe() has been called for a device then the following helper
|
||||
functions may also be used in interrupt context:
|
||||
|
||||
pm_runtime_idle()
|
||||
pm_runtime_suspend()
|
||||
pm_runtime_autosuspend()
|
||||
pm_runtime_resume()
|
||||
pm_runtime_get_sync()
|
||||
pm_runtime_put_sync()
|
||||
pm_runtime_put_sync_suspend()
|
||||
pm_runtime_put_sync_autosuspend()
|
||||
|
||||
5. Runtime PM Initialization, Device Probing and Removal
|
||||
|
||||
|
@ -782,6 +789,16 @@ will behave normally, not taking the autosuspend delay into account.
|
|||
Similarly, if the power.use_autosuspend field isn't set then the autosuspend
|
||||
helper functions will behave just like the non-autosuspend counterparts.
|
||||
|
||||
Under some circumstances a driver or subsystem may want to prevent a device
|
||||
from autosuspending immediately, even though the usage counter is zero and the
|
||||
autosuspend delay time has expired. If the ->runtime_suspend() callback
|
||||
returns -EAGAIN or -EBUSY, and if the next autosuspend delay expiration time is
|
||||
in the future (as it normally would be if the callback invoked
|
||||
pm_runtime_mark_last_busy()), the PM core will automatically reschedule the
|
||||
autosuspend. The ->runtime_suspend() callback can't do this rescheduling
|
||||
itself because no suspend requests of any kind are accepted while the device is
|
||||
suspending (i.e., while the callback is running).
|
||||
|
||||
The implementation is well suited for asynchronous use in interrupt contexts.
|
||||
However such use inevitably involves races, because the PM core can't
|
||||
synchronize ->runtime_suspend() callbacks with the arrival of I/O requests.
|
||||
|
|
275
Documentation/power/suspend-and-cpuhotplug.txt
Normal file
275
Documentation/power/suspend-and-cpuhotplug.txt
Normal file
|
@ -0,0 +1,275 @@
|
|||
Interaction of Suspend code (S3) with the CPU hotplug infrastructure
|
||||
|
||||
(C) 2011 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
|
||||
|
||||
|
||||
I. How does the regular CPU hotplug code differ from how the Suspend-to-RAM
|
||||
infrastructure uses it internally? And where do they share common code?
|
||||
|
||||
Well, a picture is worth a thousand words... So ASCII art follows :-)
|
||||
|
||||
[This depicts the current design in the kernel, and focusses only on the
|
||||
interactions involving the freezer and CPU hotplug and also tries to explain
|
||||
the locking involved. It outlines the notifications involved as well.
|
||||
But please note that here, only the call paths are illustrated, with the aim
|
||||
of describing where they take different paths and where they share code.
|
||||
What happens when regular CPU hotplug and Suspend-to-RAM race with each other
|
||||
is not depicted here.]
|
||||
|
||||
On a high level, the suspend-resume cycle goes like this:
|
||||
|
||||
|Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw |
|
||||
|tasks | | cpus | | | | cpus | |tasks|
|
||||
|
||||
|
||||
More details follow:
|
||||
|
||||
Suspend call path
|
||||
-----------------
|
||||
|
||||
Write 'mem' to
|
||||
/sys/power/state
|
||||
syfs file
|
||||
|
|
||||
v
|
||||
Acquire pm_mutex lock
|
||||
|
|
||||
v
|
||||
Send PM_SUSPEND_PREPARE
|
||||
notifications
|
||||
|
|
||||
v
|
||||
Freeze tasks
|
||||
|
|
||||
|
|
||||
v
|
||||
disable_nonboot_cpus()
|
||||
/* start */
|
||||
|
|
||||
v
|
||||
Acquire cpu_add_remove_lock
|
||||
|
|
||||
v
|
||||
Iterate over CURRENTLY
|
||||
online CPUs
|
||||
|
|
||||
|
|
||||
| ----------
|
||||
v | L
|
||||
======> _cpu_down() |
|
||||
| [This takes cpuhotplug.lock |
|
||||
Common | before taking down the CPU |
|
||||
code | and releases it when done] | O
|
||||
| While it is at it, notifications |
|
||||
| are sent when notable events occur, |
|
||||
======> by running all registered callbacks. |
|
||||
| | O
|
||||
| |
|
||||
| |
|
||||
v |
|
||||
Note down these cpus in | P
|
||||
frozen_cpus mask ----------
|
||||
|
|
||||
v
|
||||
Disable regular cpu hotplug
|
||||
by setting cpu_hotplug_disabled=1
|
||||
|
|
||||
v
|
||||
Release cpu_add_remove_lock
|
||||
|
|
||||
v
|
||||
/* disable_nonboot_cpus() complete */
|
||||
|
|
||||
v
|
||||
Do suspend
|
||||
|
||||
|
||||
|
||||
Resuming back is likewise, with the counterparts being (in the order of
|
||||
execution during resume):
|
||||
* enable_nonboot_cpus() which involves:
|
||||
| Acquire cpu_add_remove_lock
|
||||
| Reset cpu_hotplug_disabled to 0, thereby enabling regular cpu hotplug
|
||||
| Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop]
|
||||
| Release cpu_add_remove_lock
|
||||
v
|
||||
|
||||
* thaw tasks
|
||||
* send PM_POST_SUSPEND notifications
|
||||
* Release pm_mutex lock.
|
||||
|
||||
|
||||
It is to be noted here that the pm_mutex lock is acquired at the very
|
||||
beginning, when we are just starting out to suspend, and then released only
|
||||
after the entire cycle is complete (i.e., suspend + resume).
|
||||
|
||||
|
||||
|
||||
Regular CPU hotplug call path
|
||||
-----------------------------
|
||||
|
||||
Write 0 (or 1) to
|
||||
/sys/devices/system/cpu/cpu*/online
|
||||
sysfs file
|
||||
|
|
||||
|
|
||||
v
|
||||
cpu_down()
|
||||
|
|
||||
v
|
||||
Acquire cpu_add_remove_lock
|
||||
|
|
||||
v
|
||||
If cpu_hotplug_disabled is 1
|
||||
return gracefully
|
||||
|
|
||||
|
|
||||
v
|
||||
======> _cpu_down()
|
||||
| [This takes cpuhotplug.lock
|
||||
Common | before taking down the CPU
|
||||
code | and releases it when done]
|
||||
| While it is at it, notifications
|
||||
| are sent when notable events occur,
|
||||
======> by running all registered callbacks.
|
||||
|
|
||||
|
|
||||
v
|
||||
Release cpu_add_remove_lock
|
||||
[That's it!, for
|
||||
regular CPU hotplug]
|
||||
|
||||
|
||||
|
||||
So, as can be seen from the two diagrams (the parts marked as "Common code"),
|
||||
regular CPU hotplug and the suspend code path converge at the _cpu_down() and
|
||||
_cpu_up() functions. They differ in the arguments passed to these functions,
|
||||
in that during regular CPU hotplug, 0 is passed for the 'tasks_frozen'
|
||||
argument. But during suspend, since the tasks are already frozen by the time
|
||||
the non-boot CPUs are offlined or onlined, the _cpu_*() functions are called
|
||||
with the 'tasks_frozen' argument set to 1.
|
||||
[See below for some known issues regarding this.]
|
||||
|
||||
|
||||
Important files and functions/entry points:
|
||||
------------------------------------------
|
||||
|
||||
kernel/power/process.c : freeze_processes(), thaw_processes()
|
||||
kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish()
|
||||
kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), [disable|enable]_nonboot_cpus()
|
||||
|
||||
|
||||
|
||||
II. What are the issues involved in CPU hotplug?
|
||||
-------------------------------------------
|
||||
|
||||
There are some interesting situations involving CPU hotplug and microcode
|
||||
update on the CPUs, as discussed below:
|
||||
|
||||
[Please bear in mind that the kernel requests the microcode images from
|
||||
userspace, using the request_firmware() function defined in
|
||||
drivers/base/firmware_class.c]
|
||||
|
||||
|
||||
a. When all the CPUs are identical:
|
||||
|
||||
This is the most common situation and it is quite straightforward: we want
|
||||
to apply the same microcode revision to each of the CPUs.
|
||||
To give an example of x86, the collect_cpu_info() function defined in
|
||||
arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU
|
||||
and thereby in applying the correct microcode revision to it.
|
||||
But note that the kernel does not maintain a common microcode image for the
|
||||
all CPUs, in order to handle case 'b' described below.
|
||||
|
||||
|
||||
b. When some of the CPUs are different than the rest:
|
||||
|
||||
In this case since we probably need to apply different microcode revisions
|
||||
to different CPUs, the kernel maintains a copy of the correct microcode
|
||||
image for each CPU (after appropriate CPU type/model discovery using
|
||||
functions such as collect_cpu_info()).
|
||||
|
||||
|
||||
c. When a CPU is physically hot-unplugged and a new (and possibly different
|
||||
type of) CPU is hot-plugged into the system:
|
||||
|
||||
In the current design of the kernel, whenever a CPU is taken offline during
|
||||
a regular CPU hotplug operation, upon receiving the CPU_DEAD notification
|
||||
(which is sent by the CPU hotplug code), the microcode update driver's
|
||||
callback for that event reacts by freeing the kernel's copy of the
|
||||
microcode image for that CPU.
|
||||
|
||||
Hence, when a new CPU is brought online, since the kernel finds that it
|
||||
doesn't have the microcode image, it does the CPU type/model discovery
|
||||
afresh and then requests the userspace for the appropriate microcode image
|
||||
for that CPU, which is subsequently applied.
|
||||
|
||||
For example, in x86, the mc_cpu_callback() function (which is the microcode
|
||||
update driver's callback registered for CPU hotplug events) calls
|
||||
microcode_update_cpu() which would call microcode_init_cpu() in this case,
|
||||
instead of microcode_resume_cpu() when it finds that the kernel doesn't
|
||||
have a valid microcode image. This ensures that the CPU type/model
|
||||
discovery is performed and the right microcode is applied to the CPU after
|
||||
getting it from userspace.
|
||||
|
||||
|
||||
d. Handling microcode update during suspend/hibernate:
|
||||
|
||||
Strictly speaking, during a CPU hotplug operation which does not involve
|
||||
physically removing or inserting CPUs, the CPUs are not actually powered
|
||||
off during a CPU offline. They are just put to the lowest C-states possible.
|
||||
Hence, in such a case, it is not really necessary to re-apply microcode
|
||||
when the CPUs are brought back online, since they wouldn't have lost the
|
||||
image during the CPU offline operation.
|
||||
|
||||
This is the usual scenario encountered during a resume after a suspend.
|
||||
However, in the case of hibernation, since all the CPUs are completely
|
||||
powered off, during restore it becomes necessary to apply the microcode
|
||||
images to all the CPUs.
|
||||
|
||||
[Note that we don't expect someone to physically pull out nodes and insert
|
||||
nodes with a different type of CPUs in-between a suspend-resume or a
|
||||
hibernate/restore cycle.]
|
||||
|
||||
In the current design of the kernel however, during a CPU offline operation
|
||||
as part of the suspend/hibernate cycle (the CPU_DEAD_FROZEN notification),
|
||||
the existing copy of microcode image in the kernel is not freed up.
|
||||
And during the CPU online operations (during resume/restore), since the
|
||||
kernel finds that it already has copies of the microcode images for all the
|
||||
CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU
|
||||
type/model and the need for validating whether the microcode revisions are
|
||||
right for the CPUs or not (due to the above assumption that physical CPU
|
||||
hotplug will not be done in-between suspend/resume or hibernate/restore
|
||||
cycles).
|
||||
|
||||
|
||||
III. Are there any known problems when regular CPU hotplug and suspend race
|
||||
with each other?
|
||||
|
||||
Yes, they are listed below:
|
||||
|
||||
1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to
|
||||
the _cpu_down() and _cpu_up() functions is *always* 0.
|
||||
This might not reflect the true current state of the system, since the
|
||||
tasks could have been frozen by an out-of-band event such as a suspend
|
||||
operation in progress. Hence, it will lead to wrong notifications being
|
||||
sent during the cpu online/offline events (eg, CPU_ONLINE notification
|
||||
instead of CPU_ONLINE_FROZEN) which in turn will lead to execution of
|
||||
inappropriate code by the callbacks registered for such CPU hotplug events.
|
||||
|
||||
2. If a regular CPU hotplug stress test happens to race with the freezer due
|
||||
to a suspend operation in progress at the same time, then we could hit the
|
||||
situation described below:
|
||||
|
||||
* A regular cpu online operation continues its journey from userspace
|
||||
into the kernel, since the freezing has not yet begun.
|
||||
* Then freezer gets to work and freezes userspace.
|
||||
* If cpu online has not yet completed the microcode update stuff by now,
|
||||
it will now start waiting on the frozen userspace in the
|
||||
TASK_UNINTERRUPTIBLE state, in order to get the microcode image.
|
||||
* Now the freezer continues and tries to freeze the remaining tasks. But
|
||||
due to this wait mentioned above, the freezer won't be able to freeze
|
||||
the cpu online hotplug task and hence freezing of tasks fails.
|
||||
|
||||
As a result of this task freezing failure, the suspend operation gets
|
||||
aborted.
|
|
@ -77,7 +77,8 @@ SNAPSHOT_SET_SWAP_AREA - set the resume partition and the offset (in <PAGE_SIZE>
|
|||
resume_swap_area, as defined in kernel/power/suspend_ioctls.h,
|
||||
containing the resume device specification and the offset); for swap
|
||||
partitions the offset is always 0, but it is different from zero for
|
||||
swap files (see Documentation/swsusp-and-swap-files.txt for details).
|
||||
swap files (see Documentation/power/swsusp-and-swap-files.txt for
|
||||
details).
|
||||
|
||||
SNAPSHOT_PLATFORM_SUPPORT - enable/disable the hibernation platform support,
|
||||
depending on the argument value (enable, if the argument is nonzero)
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue