Version 20120215.
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Simplifies the code, especially the compile-time
ACPI_REDUCED_HARDWARE option.
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
The functions for the original/legacy sleep/wake registers are in
hwsleep.c, and the functions for the new extended FADT V5 sleep
registers are in hwesleep.c
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Adds acpi_hw_execute_sleep_method to handle the various sleep methods
such as _GTS, _BFS, _WAK, and _SST. Removes the specialized
functions previously used for each of these methods.
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
This interface allows the host to override a table via a
physical address, instead of the logical address required by
acpi_os_table_override. This simplifies the host implementation.
Initial implementation by Thomas Renninger. ACPICA implementation
creates a single function for table overrides that attempts both
a logical and a physical override.
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Add new notify values, add support for "hardware specific" notifies.
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
This change expands acpi_os_read_memory and acpi_os_write_memory to a
full 64 bits. This allows 64 bit transfers via the acpi_read and
acpi_write interfaces. Note: The internal acpi_hw_read and acpi_hw_write
interfaces remain at 32 bits, because 64 bits is not needed to
access the standard ACPI registers.
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Add ACPI_REDUCED_HARDWARE flag that removes all hardware-related
code (about 10% code, 5% static data).
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
These prototypes were incorrectly declared in achware.h.
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Adds sleep and wake support for systems with these registers.
One new file, hwxfsleep.c
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
_REV returns the supported ACPI revision level, which is now 5.
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
The recent cpuidle consolidation changes erroneously omitted one
critical line of code.
Signed-off-by: Robert Lee <rob.lee@linaro.org>
Tested-by: Sekhar Nori <nsekhar@ti.com>
Acked-by: Sekhar Nori <nsekhar@ti.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Basically without this patch changing the mode of thermal zone
is not possible as wrong string size is passed to strncmp.
Signed-off-by: Amit Daniel Kachhap <amit.kachhap@linaro.org>
Acked-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Len Brown <len.brown@intel.com>
thermal_zone_device_register() never returns NULL, on error it returns and
ERR_PTR().
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Viresh Kumar <viresh.kumar@st.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@st.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
readl/writel versions for ARM contain memory barrier instruction for
synchronizing DMA buffers. These are not required at least on this
module. So use lighter _relaxed variants.
Signed-off-by: Viresh Kumar <viresh.kumar@st.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
ST's SPEAr13xx machines are based on CortexA9 ARM processors. These
machines contain a thermal sensor for junction temperature monitoring.
This patch adds support for this thermal sensor in existing thermal
framework.
[akpm@linux-foundation.org: little code cleanup]
[akpm@linux-foundation.org: print the pointer correctly]
[viresh.kumar@st.com: thermal/spear_thermal: add compilation dependency on PLAT_SPEAR]
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@st.com>
Signed-off-by: Viresh Kumar <viresh.kumar@st.com>
Signed-off-by: Viresh Kumar <viresh.kumar@st.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
Use the current logging style.
Remove PREFIX, add pr_fmt, convert the printks. All dmesg output now
prefixed with "thermal_sys: ".
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Jesper Juhl <jj@chaosbits.net>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
Just a few tidies to make it more like most kernel sources.
A couple of long lines still remain.
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Jesper Juhl <jj@chaosbits.net>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
These don't add any value as they are used only once and the surrounding
code uses similar variable.
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Jesper Juhl <jj@chaosbits.net>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
Line continations are not necessary in function calls or statements.
Remove them.
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Jesper Juhl <jj@chaosbits.net>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
With CONFIG_NET=n:
drivers/thermal/thermal_sys.c:63: warning: 'thermal_event_seqnum' defined but not used
Move 'thermal_event_seqnum' definition inside the '#ifdef CONFIG_NET'
[akpm@linux-foundation.org: make thermal_event_seqnum local to generate_netlink_event()]
Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com>
Acked-by: Guenter Roeck <guenter.roeck@ericsson.com>
Acked-by: Durgadoss R <durgadoss.r@intel.com>
Cc: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
Remove the 'sb_mark_dirty()', 'sb_mark_clean()' and 'sb_is_dirty()'
helpers which are not used. I introduced them 2 years and the
intention was to make all file-systems use them in order to be able to
optimize 'sync_supers()'. However, Al Viro vetoed my patches at the
end and asked me to push superblock management down to file-systems
and get rid of the 's_dirt' flag completely, as well as kill
'sync_supers()' altogether. Thus, remove the helpers.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Export 'dirty_writeback_interval' to make it visible to
file-systems. We are going to push superblock management down to
file-systems and get rid of the 'sync_supers' kernel thread completly.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Clean-up ext4 a tiny bit by removing useless s_dirt assignment in
'ext4_fill_super()' because a bit later we anyway call
'ext4_setup_super()' which writes the superblock to the media
unconditionally.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In some rather rare cases it is possible that ext4 may the superblock
to the media twice. This patch makes sure this does not happen. This
should speed up unmounting in those cases.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Commit a0375156ca cleaned up superblock
dirtying handling, but missed one place. This patch does what was
intended: if we have the journal, then we update the superblock
through the journal rather than doing this directly.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_punch_hole returns -ENOTSUPP but it should be using -EOPNOTSUPP
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When L2TP is configured as a module, requests for L2TP sockets do not result
in the l2tp_ppp module being loaded. Fix this by adding the appropriate
MODULE_ALIAS to be recognized by pppox's request_module() call.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recently added parity error handling used an error code that was
already defined for a different error. This could lead to bnx2x
firmware assert. We need to fix this with new error codes that are
defined for parity error only.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Reviewed-by: Eddie Wai <eddie.wai@broadcom.com>
Reviewed-by: Bhanu Prakash Gollapudi <bprakash@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The RSS feature in tg3 hardware has only one rx producer ring for all
RSS rings. NAPI vector 1 is special and handles the refilling of the
rx producer ring on behalf of all RSS rings. There is a race condition
between these RSS NAPIs and the NAPI[1]. If NAPI[1] finishes checking
for refill and then another RSS ring empties the rx producer ring
before NAPI[1] exits NAPI, the chip will be completely out of SKBs in
the rx producer ring.
We fix this by adding a flag tp->rx_refill and rely on napi_schedule()/
napi_complete() to help synchronize it to close the race condition.
Update driver version to 3.123.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull powerpc merge from Benjamin Herrenschmidt:
"Here's the powerpc batch for this merge window. It is going to be a
bit more nasty than usual as in touching things outside of
arch/powerpc mostly due to the big iSeriesectomy :-) We finally got
rid of the bugger (legacy iSeries support) which was a PITA to
maintain and that nobody really used anymore.
Here are some of the highlights:
- Legacy iSeries is gone. Thanks Stephen ! There's still some bits
and pieces remaining if you do a grep -ir series arch/powerpc but
they are harmless and will be removed in the next few weeks
hopefully.
- The 'fadump' functionality (Firmware Assisted Dump) replaces the
previous (equivalent) "pHyp assisted dump"... it's a rewrite of a
mechanism to get the hypervisor to do crash dumps on pSeries, the
new implementation hopefully being much more reliable. Thanks
Mahesh Salgaonkar.
- The "EEH" code (pSeries PCI error handling & recovery) got a big
spring cleaning, motivated by the need to be able to implement a
new backend for it on top of some new different type of firwmare.
The work isn't complete yet, but a good chunk of the cleanups is
there. Note that this adds a field to struct device_node which is
not very nice and which Grant objects to. I will have a patch soon
that moves that to a powerpc private data structure (hopefully
before rc1) and we'll improve things further later on (hopefully
getting rid of the need for that pointer completely). Thanks Gavin
Shan.
- I dug into our exception & interrupt handling code to improve the
way we do lazy interrupt handling (and make it work properly with
"edge" triggered interrupt sources), and while at it found & fixed
a wagon of issues in those areas, including adding support for page
fault retry & fatal signals on page faults.
- Your usual random batch of small fixes & updates, including a bunch
of new embedded boards, both Freescale and APM based ones, etc..."
I fixed up some conflicts with the generalized irq-domain changes from
Grant Likely, hopefully correctly.
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (141 commits)
powerpc/ps3: Do not adjust the wrapper load address
powerpc: Remove the rest of the legacy iSeries include files
powerpc: Remove the remaining CONFIG_PPC_ISERIES pieces
init: Remove CONFIG_PPC_ISERIES
powerpc: Remove FW_FEATURE ISERIES from arch code
tty/hvc_vio: FW_FEATURE_ISERIES is no longer selectable
powerpc/spufs: Fix double unlocks
powerpc/5200: convert mpc5200 to use of_platform_populate()
powerpc/mpc5200: add options to mpc5200_defconfig
powerpc/mpc52xx: add a4m072 board support
powerpc/mpc5200: update mpc5200_defconfig to fit for charon board
Documentation/powerpc/mpc52xx.txt: Checkpatch cleanup
powerpc/44x: Add additional device support for APM821xx SoC and Bluestone board
powerpc/44x: Add support PCI-E for APM821xx SoC and Bluestone board
MAINTAINERS: Update PowerPC 4xx tree
powerpc/44x: The bug fixed support for APM821xx SoC and Bluestone board
powerpc: document the FSL MPIC message register binding
powerpc: add support for MPIC message register API
powerpc/fsl: Added aliased MSIIR register address to MSI node in dts
powerpc/85xx: mpc8548cds - add 36-bit dts
...
We are going to remove the EOFBLOCKS_FL flag in the future, so this is
the first part of the removal. We can not remove it entirely just now,
since the e2fsck is still checking for it and it might cause headache to
some people. Instead, remove the restrictive checks now and the rest
later, when the new e2fsck code is out and common enough.
This is also needed because punch hole already breaks the EOFBLOCKS_FL
semantics, so it might cause the some troubles. So simply remove it.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently if the range to trim is too small, for example on 1K fs
the request to trim the first block, then the 'range->len' is not set
reporting wrong number of discarded block to the caller.
Fix this by always setting the 'range->len' before we return. Note that
when there is a failure (-EINVAL) caller can not depend on 'range->len'
being set properly.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently when there is not enough free blocks in the block group to
discard (grp->bb_free < minlen) the 'trimmed' is bumped up anyway with
the number of discarded blocks from the previous iteration. Fix this
by bumping up 'trimmed' only if the ext4_trim_all_free() was actually
run.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The overflow can happen when we are calling get_group_no_and_offset()
which stores the group number in the ext4_grpblk_t type which is
actually int. However when the blocknr is big enough the group number
might be bigger than ext4_grpblk_t resulting in overflow. This will
most likely happen with FITRIM default argument len = ULLONG_MAX.
Fix this by using "end" variable instead of "start+len" as it is easier
to get right and specifically check that the end is not beyond the end
of the file system, so we are sure that the result of
get_group_no_and_offset() will not overflow. Otherwise truncate it to
the size of the file system.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Pull m68knommu arch updates from Greg Ungerer:
"Includes a cleanup of the non-MMU linker script (it now almost
exclusively uses the well defined linker script support macros and
definitions). Some more merging of MMU and non-MMU common files
(specifically the arch process.c, ptrace and time.c). And a big
cleanup of the massively duplicated ColdFire device definition code.
Overall we remove about 2000 lines of code, and end up with a single
set of platform device definitions for the serial ports, ethernet
ports and QSPI ports common in most ColdFire SoCs.
I expect you will get a merge conflict on arch/m68k/kernel/process.c,
in cpu_idle(). It should be relatively strait forward to fixup."
And cpu_idle() conflict resolution was indeed trivial (merging the
nommu/mmu versions of process.c trivially conflicting with the
conversion to use the schedule_preempt_disabled() helper function)
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu: (57 commits)
m68knommu: factor more common ColdFire cpu reset code
m68knommu: make 528x CPU reset register addressing consistent
m68knommu: make 527x CPU reset register addressing consistent
m68knommu: make 523x CPU reset register addressing consistent
m68knommu: factor some common ColdFire cpu reset code
m68knommu: move old ColdFire timers init from CPU init to timers code
m68knommu: clean up init code in ColdFire 532x startup
m68knommu: clean up init code in ColdFire 528x startup
m68knommu: clean up init code in ColdFire 523x startup
m68knommu: merge common ColdFire QSPI platform setup code
m68knommu: make 532x QSPI platform addressing consistent
m68knommu: make 528x QSPI platform addressing consistent
m68knommu: make 527x QSPI platform addressing consistent
m68knommu: make 5249 QSPI platform addressing consistent
m68knommu: make 523x QSPI platform addressing consistent
m68knommu: make 520x QSPI platform addressing consistent
m68knommu: merge common ColdFire FEC platform setup code
m68knommu: make 532x FEC platform addressing consistent
m68knommu: make 528x FEC platform addressing consistent
m68knommu: make 527x FEC platform addressing consistent
...
This caused conflict with camellia-x86_64 when compiled into kernel, same
function names and not static.
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This caused conflict with twofish-x86_64-3way when compiled into kernel,
same function names and not static.
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Pull gfs2 changes from Steven Whitehouse.
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
GFS2: Change truncate page allocation to be GFP_NOFS
GFS2: call gfs2_write_alloc_required for each chunk
GFS2: Clean up log flush header writing
GFS2: Remove a __GFP_NOFAIL allocation
GFS2: Flush pending glock work when evicting an inode
GFS2: make sure rgrps are up to date in func gfs2_blk2rgrpd
GFS2: Eliminate sd_rindex_mutex
GFS2: Unlock rindex mutex on glock error
GFS2: Make bd_cmp() static
GFS2: Sort the ordered write list
GFS2: FITRIM ioctl support
GFS2: Move two functions from log.c to lops.c
GFS2: glock statistics gathering
Currently we can't do task migration among memory cgroups without THP
split, which means processes heavily using THP experience large overhead
in task migration. This patch introduces the code for moving charge of
THP and makes THP more valuable.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These macros will be used in a later patch, where all usages are expected
to be optimized away without #ifdef CONFIG_TRANSPARENT_HUGEPAGE. But to
detect unexpected usages, we convert the existing BUG() to BUILD_BUG().
[akpm@linux-foundation.org: fix build in mm/pgtable-generic.c]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Replace lengthy function name is_target_pte_for_mc() with a shorter
one in order to avoid ugly line breaks.
- explicitly use MC_TARGET_* instead of simply using integers.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the following code:
if (type == _MEM)
thresholds = &memcg->thresholds;
else if (type == _MEMSWAP)
thresholds = &memcg->memsw_thresholds;
else
BUG();
BUG_ON(!thresholds);
The BUG_ON() seems redundant.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_begin_update_page_stat() should be very fast because it's
called very frequently. Now, it needs to look up page_cgroup and its
memcg....this is slow.
This patch adds a global variable to check "any memcg is moving or not".
With this, the caller doesn't need to visit page_cgroup and memcg.
Here is a test result. A test program makes page faults onto a file,
MAP_SHARED and makes each page's page_mapcount(page) > 1, and free the
range by madvise() and page fault again. This program causes 26214400
times of page fault onto a file(size was 1G.) and shows shows the cost of
mem_cgroup_begin_update_page_stat().
Before this patch for mem_cgroup_begin_update_page_stat()
[kamezawa@bluextal test]$ time ./mmap 1G
real 0m21.765s
user 0m5.999s
sys 0m15.434s
27.46% mmap mmap [.] reader
21.15% mmap [kernel.kallsyms] [k] page_fault
9.17% mmap [kernel.kallsyms] [k] filemap_fault
2.96% mmap [kernel.kallsyms] [k] __do_fault
2.83% mmap [kernel.kallsyms] [k] __mem_cgroup_begin_update_page_stat
After this patch
[root@bluextal test]# time ./mmap 1G
real 0m21.373s
user 0m6.113s
sys 0m15.016s
In usual path, calls to __mem_cgroup_begin_update_page_stat() goes away.
Note: we may be able to remove this optimization in future if
we can get pointer to memcg directly from struct page.
[akpm@linux-foundation.org: don't return a void]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Greg Thelen <gthelen@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the new lock scheme for updating memcg's page stat, we don't need a
flag PCG_FILE_MAPPED which was duplicated information of page_mapped().
[hughd@google.com: cosmetic fix]
[hughd@google.com: add comment to MEM_CGROUP_CHARGE_TYPE_MAPPED case in __mem_cgroup_uncharge_common()]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Greg Thelen <gthelen@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, page-stat-per-memcg is recorded into per page_cgroup flag by
duplicating page's status into the flag. The reason is that memcg has a
feature to move a page from a group to another group and we have race
between "move" and "page stat accounting",
Under current logic, assume CPU-A and CPU-B. CPU-A does "move" and CPU-B
does "page stat accounting".
When CPU-A goes 1st,
CPU-A CPU-B
update "struct page" info.
move_lock_mem_cgroup(memcg)
see pc->flags
copy page stat to new group
overwrite pc->mem_cgroup.
move_unlock_mem_cgroup(memcg)
move_lock_mem_cgroup(mem)
set pc->flags
update page stat accounting
move_unlock_mem_cgroup(mem)
stat accounting is guarded by move_lock_mem_cgroup() and "move" logic
(CPU-A) doesn't see changes in "struct page" information.
But it's costly to have the same information both in 'struct page' and
'struct page_cgroup'. And, there is a potential problem.
For example, assume we have PG_dirty accounting in memcg.
PG_..is a flag for struct page.
PCG_ is a flag for struct page_cgroup.
(This is just an example. The same problem can be found in any
kind of page stat accounting.)
CPU-A CPU-B
TestSet PG_dirty
(delay) TestClear PG_dirty
if (TestClear(PCG_dirty))
memcg->nr_dirty--
if (TestSet(PCG_dirty))
memcg->nr_dirty++
Here, memcg->nr_dirty = +1, this is wrong. This race was reported by Greg
Thelen <gthelen@google.com>. Now, only FILE_MAPPED is supported but
fortunately, it's serialized by page table lock and this is not real bug,
_now_,
If this potential problem is caused by having duplicated information in
struct page and struct page_cgroup, we may be able to fix this by using
original 'struct page' information. But we'll have a problem in "move
account"
Assume we use only PG_dirty.
CPU-A CPU-B
TestSet PG_dirty
(delay) move_lock_mem_cgroup()
if (PageDirty(page))
new_memcg->nr_dirty++
pc->mem_cgroup = new_memcg;
move_unlock_mem_cgroup()
move_lock_mem_cgroup()
memcg = pc->mem_cgroup
new_memcg->nr_dirty++
accounting information may be double-counted. This was original reason to
have PCG_xxx flags but it seems PCG_xxx has another problem.
I think we need a bigger lock as
move_lock_mem_cgroup(page)
TestSetPageDirty(page)
update page stats (without any checks)
move_unlock_mem_cgroup(page)
This fixes both of problems and we don't have to duplicate page flag into
page_cgroup. Please note: move_lock_mem_cgroup() is held only when there
are possibility of "account move" under the system. So, in most path,
status update will go without atomic locks.
This patch introduces mem_cgroup_begin_update_page_stat() and
mem_cgroup_end_update_page_stat() both should be called at modifying
'struct page' information if memcg takes care of it. as
mem_cgroup_begin_update_page_stat()
modify page information
mem_cgroup_update_page_stat()
=> never check any 'struct page' info, just update counters.
mem_cgroup_end_update_page_stat().
This patch is slow because we need to call begin_update_page_stat()/
end_update_page_stat() regardless of accounted will be changed or not. A
following patch adds an easy optimization and reduces the cost.
[akpm@linux-foundation.org: s/lock/locked/]
[hughd@google.com: fix deadlock by avoiding stat lock when anon]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Greg Thelen <gthelen@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PCG_MOVE_LOCK is used for bit spinlock to avoid race between overwriting
pc->mem_cgroup and page statistics accounting per memcg. This lock helps
to avoid the race but the race is very rare because moving tasks between
cgroup is not a usual job. So, it seems using 1bit per page is too
costly.
This patch changes this lock as per-memcg spinlock and removes
PCG_MOVE_LOCK.
If smaller lock is required, we'll be able to add some hashes but I'd like
to start from this.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>